Skip to left side bar
>
  • File
  • Edit
  • View
  • Run
  • Kernel
  • Tabs
  • Settings
  • Help

Open Tabs

  • 031-data-wrangling-with-mongodb.ipynb
  • 032-linear-regression-with-time-series-data.ipynb
  • 033-autoregressive-models.ipynb
  • 034-arma-models-and-hyperparameter-tuning.ipynb
  • 17-ts-core.ipynb
  • 035-assignment.ipynb
  • 04-pandas-advanced.ipynb
  • 18-ts-models.ipynb
  • 11-databases-mongodb.ipynb
  • 10-databases-sql.ipynb

Kernels

  • 031-data-wrangling-with-mongodb.ipynb
  • 17-ts-core.ipynb
  • 11-databases-mongodb.ipynb
  • 035-assignment.ipynb
  • 18-ts-models.ipynb
  • 034-arma-models-and-hyperparameter-tuning.ipynb
  • 10-databases-sql.ipynb
  • 04-pandas-advanced.ipynb
  • 033-autoregressive-models.ipynb
  • 032-linear-regression-with-time-series-data.ipynb

Terminals

    //ds-curriculum/030-air-quality-in-nairobi/
    Name
    ...
    Last Modified
    • .ipynb_checkpoints7 days ago
    • images5 days ago
    • 031-data-wrangling-with-mongodb.ipynb5 days ago
    • 032-linear-regression-with-time-series-data.ipynb14 days ago
    • 033-autoregressive-models.ipynba day ago
    • 034-arma-models-and-hyperparameter-tuning.ipynb6 minutes ago
    • 035-assignment.ipynb5 days ago
    ​x
     

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    <font size="+3"><strong>Time Series: Statistical Models</strong></font>

    Time Series: Statistical Models

    # Autoregression

    Autoregression¶

    Autoregression (AR) is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. AR works similarly to autocorrelation: in both cases, we're taking data from one part of a set and comparing it to another part. An AR model regresses itself.

    Cleaning the Data¶

    Just like with linear regression, we'll start by bringing in some tools to help us along the way.

    [ ]:
    warnings.simplefilter(action="ignore", category=FutureWarning)
    Since we'll be working with the `"air-quality"` data again, we need to connect to the server, start our client, and grab the data we need.

    Since we'll be working with the "air-quality" data again, we need to connect to the server, start our client, and grab the data we need.

    [ ]:
    client = MongoClient(host="localhost", port=27017)
    <font size="+1">Practice</font>

    Practice

    Just to make sure we're all on the same page, import all those libraries and get your database up and running. Remember that even though all the examples use the Site 3 data from the lagos collection, the practice sets should use Site 4 data from the lagos collection. Call your database lagos_prac.

    [ ]:
    lagos_prac = ...
    In order to get our data into a form we can use to build our model, we're going to need to transform it in several key ways. The first thing we need to do is to get the data we need, and save the results in a DataFrame. Since we're interested in predicting the changes in air quality over time, let's set the DataFrame's index to `"timestamp"`:

    In order to get our data into a form we can use to build our model, we're going to need to transform it in several key ways. The first thing we need to do is to get the data we need, and save the results in a DataFrame. Since we're interested in predicting the changes in air quality over time, let's set the DataFrame's index to "timestamp":

    [ ]:
        {"metadata.site": 3, "metadata.measurement": "P2"},
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Create a list called results_prac that pulls data from Site 4 in the lagos data, then save it in a DataFrame called df_prac with the index "timestamp".

    [ ]:
    ​x
     
    ## Localizing the Timezone

    Localizing the Timezone¶

    Because MongoDB stores all timestamps in UTC, we need to figure out a way to localize it. Having timestamps in UTC might be useful if we were trying to predict some kind of global trend, but since we're only interested in what's happening with the air in Lagos, we need to change the data from UTC to Africa/Lagos. Happily, pandas has a pair of tools to help us out: tz_localize and tz_convert. We use those methods to transform our data like this:

    [ ]:
    df.index = df.index.tz_localize("UTC").tz_convert("Africa/Lagos")
    ## Resampling Data

    Resampling Data¶

    The most important kind of data in our time-series model is the data that deals with time. Our "timestamp" data tells us when each reading was taken, but in order to create a good predictive model, we need the readings to happen at regular intervals. Our data doesn't do that, so we need to figure out a way to change it so that it does. The resample method does that for us.

    Let's resample our data to create 1-hour reading intervals by aggregating using the mean:

    [ ]:
    df = df["P2"].resample("1H").mean().fillna(method="ffill").to_frame()
    Notice the second half of the code:

    Notice the second half of the code:

    fillna(method="ffill").to_frame()
    

    That tells the model to forward-fill any empty cells with imputed data. Forward-filling means that the model should start imputing data based on the closest cell that actually has data in it. This helps to keep the imputed data in line with the rest of the dataset.

    Adding a Lag¶

    We've spent some time elsewhere thinking about how two sets of data — apartment price and location, for example — compare to each other, but we haven't had any reason to consider how a dataset might compare to itself. If we're predicting the future, we want to know how good our prediction will be, so it might be useful to build some of that accountability into our model. To do that, we need to add a lag.

    Lagging data means that we're adding a delay. In this case, we're going to allow the model to test itself out by comparing its predictions with what actually happened an hour before. If the prediction and the reality are close, then it's a good model; if they aren't, then the model isn't a very good one.

    So, let's add a one-hour lag to our dataset:

    [ ]:
    # In `shift(1), the `1` is the lagged interval.
    Finally, let's drop our null values:

    Finally, let's drop our null values:

    [ ]:
    y = df["P2"].resample("1H").mean().fillna(method="ffill")
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Clean the Site 2 data from lagos, and save it as a Series called y_prac.

    [ ]:
    df_prac["P2.L1"] = ...
    ## Exploring the Data

    Exploring the Data¶

    ### Time Series Line Plot

    Time Series Line Plot¶

    Example of 

    Example of

    ### Creating an ACF Plot

    Creating an ACF Plot¶

    Let's make an ACF plot using our y Series.

    [ ]:
    fig1, ax = plt.subplots(figsize=(15, 6))
    Each of the dots on our plot represents a correlation coefficient. The first data point in the top left of the graph tells us that at time-step 0, the correlation coefficient was 1, meaning that there was a perfect correlation. That makes sense, because you can't lag from time-step 0, so the coefficient can't be anything other than 1. But, starting at hour 1, the coefficient drops precipitously, and we see our autocorrelation coefficients slowly decay over time. As our lag recedes further into the past, the correlations break down; a prediction you made five hours ago about what's happening right now is going to be a lot more reliable than a prediction you made 96 hours ago.

    Each of the dots on our plot represents a correlation coefficient. The first data point in the top left of the graph tells us that at time-step 0, the correlation coefficient was 1, meaning that there was a perfect correlation. That makes sense, because you can't lag from time-step 0, so the coefficient can't be anything other than 1. But, starting at hour 1, the coefficient drops precipitously, and we see our autocorrelation coefficients slowly decay over time. As our lag recedes further into the past, the correlations break down; a prediction you made five hours ago about what's happening right now is going to be a lot more reliable than a prediction you made 96 hours ago.

    The light blue shape across the bottom of the graph represents the confidence interval, or the extent to which we can be sure that our estimated correlations reflect the correlations that exist in reality. By default, this is set to 95%. Data points which fall either above or below the shape are likely not due to chance, and those which fall inside the shape are likely due to chance. It looks like all our data is the result of some kind of effect, so we're good to go.

    Practice

    Try it yourself! Make an ACF plot called fig2 using your y_prac Series.

    [ ]:
    fig2, ax = ...
    ### Creating a PACF Plot

    Creating a PACF Plot¶

    Let's make a PACF plot using our y Series.

    [ ]:
    fig1, ax = plt.subplots(figsize=(15, 6))
    Aha! This looks very different. There are two things to notice here:

    Aha! This looks very different. There are two things to notice here:

    First, we now have lots of data points that we can be relatively certain aren't due to chance, but we also have lots of data points inside the blue shape at the bottom, indicating that some of our data points are indeed due to chance. That's not necessarily a problem, but it's something useful to keep in mind.

    Second, recognize that even though the amplitude of the points on our graph has been significantly reduced, the trend has remained essentially the same: Strong positive correlations at the beginning, with the effect decaying over time. We would expect to see that, because the farther out into the future our predictions go, the less accurate they become.

    Practice

    Try it yourself! Make an PACF plot using your y_prac Series.

    [ ]:
    fig2, ax = ...
    ## Working with Rolling Windows

    Working with Rolling Windows¶

    Rolling window is an important concept for time series analysis. We first define a window size, like 7 days, three months, etc. Then we calculate some statistics taking data from each window sequentially throughout the time series. For example, if I want to calculate a three-month rolling sum with the time series data below:

    Month sales
    2022-01 10
    2022-02 20
    2022-03 25
    2022-04 15
    2022-05 20
    2022-06 30

    The three-month rolling sum would be

    Rolling Months Rolling sum sales
    2022-01,02,03 55
    2022-02,03,04 60
    2022-03,04,05 60
    2022-04,05,06 65
    Rolling window statistics are very helpful in smoothing noisy data when making time series predictions. Let's see it with an example. Since we're interested in making predictions about the air quality in Lagos, it would be helpful to understand the rolling average for the PM 2.5 readings with a line plot. To keep things manageable, we'll set our window-size to one week.

    Rolling window statistics are very helpful in smoothing noisy data when making time series predictions. Let's see it with an example. Since we're interested in making predictions about the air quality in Lagos, it would be helpful to understand the rolling average for the PM 2.5 readings with a line plot. To keep things manageable, we'll set our window-size to one week.

    [ ]:
    # `168` is the number of hours in a week.
    Even though there are lots of peaks and valleys here, we're starting to see an emerging trend.

    Even though there are lots of peaks and valleys here, we're starting to see an emerging trend.

    We can make the same graph using pandas, like this:

    Practice

    Try it yourself! Make a line plot that shows the weekly rolling average of the P2 values in the Site 2 dataset.

    [ ]:
    ​x
     
    Besides rolling sum and rolling average, rolling window statistics can be applied to a lot of other statistics depends on the problem you are facing. In the example below, when we use **GARCH** model to analyze stock prices, we can use rolling window to calculate standard deviation.

    Besides rolling sum and rolling average, rolling window statistics can be applied to a lot of other statistics depends on the problem you are facing. In the example below, when we use GARCH model to analyze stock prices, we can use rolling window to calculate standard deviation.

    ### Splitting the Data in pandas

    Splitting the Data in pandas¶

    The last thing to do in our data exploration is to split our data into training and test sets. For linear regression, we used an 80/20 split, where we used 80% of the data was our training set, and 20% of it was our test set. This time, we're going to expand the test set to 95%, and decrease the test set to %5 to bring it into line with statsmodels default confidence interval. This is important, because we'll need to use as much training data as we can if our model is going to accurately predict what's going to come next.

    [ ]:
    cutoff_test1 = int(len(y) * 0.95)
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Create a cutoff called cutoff_test2, split the y_pracSeries into training and test sets, making sure to set the cutoff to 0.95.

    [ ]:
    cutoff_test2 = ...
    ## Building the Model

    Building the Model¶

    ### Baseline

    Baseline¶

    First, let's calculate the baseline MAE for our model.

    [ ]:
    mae_baseline = mean_absolute_error(y_train, y_pred_baseline)
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Calculate the baseline mean and MAE for the y_prac Series.

    [ ]:
    y_prac_pred_baseline = ...
    ### Iterating

    Iterating¶

    Before we can go any further, we need to instantiate an autoregression model based on our y training data. We'll call the model model.

    [ ]:
    model = AutoReg(y_train, lags=24, old_names=False).fit()
    Notice that, unlike our linear regression model which we built using scikit-learn, we're combining instantiation and fitting into one step; statsmodels includes that ability in its `AutoReg` method.

    Notice that, unlike our linear regression model which we built using scikit-learn, we're combining instantiation and fitting into one step; statsmodels includes that ability in its AutoReg method.

    Practice

    Try it yourself! Create and fit an autoregression model called model_prac.

    [ ]:
    model_prac = ...
    Autoregression models need us to generate **in-sample predictions** in order to calculate the MAE of our training data. In-sample predictions use data that's already part of our sample. That's to distinguish it from out-of-sample predictions, which we'll talk about a bit later. The statsmodels library includes a method called [`predict`](https://www.statsmodels.org/stable/examples/notebooks/generated/predict.html) that can help us here. Above, the `AutoReg` method includes this line:

    Autoregression models need us to generate in-sample predictions in order to calculate the MAE of our training data. In-sample predictions use data that's already part of our sample. That's to distinguish it from out-of-sample predictions, which we'll talk about a bit later. The statsmodels library includes a method called predict that can help us here. Above, the AutoReg method includes this line:

    old_names=False
    

    The False value here tells the model that it can use in-sample lagged values to make predictions; if the value had been True, the model would have to look elsewhere to make its predictions.

    Here's how to generate in-sample predictions:

    [ ]:
    y_pred = model.predict().dropna()
    Once we've done that, we can calculate the MAE of the predictions in our training set.

    Once we've done that, we can calculate the MAE of the predictions in our training set.

    [ ]:
    training_mae = mean_absolute_error(y_train.loc[y_pred.index], y_pred)
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Generate in-sample predictions using y_prac, and find the MAE for your y_prac training data. Print the result.

    [ ]:
        y_prac_train.loc[y_prac_pred.index], y_prac_pred
    ### Residuals

    Residuals¶

    We're going to use our model's residuals to make some visualizations, but first, we need to calculate what those residuals are.

    [ ]:
    y_train_resid = y_train - y_pred
    Now we can make a line plot of our model's residuals.

    Now we can make a line plot of our model's residuals.

    [ ]:
    fig1, ax = plt.subplots(figsize=(15, 6))
    The ideal residual plot has a random set of datapoints spread evenly on both sides of the line. The plot we just made actually looks pretty good; there are some significant outliers, but, on the whole, the bars describe an even band of values, which is what we're looking for.

    The ideal residual plot has a random set of datapoints spread evenly on both sides of the line. The plot we just made actually looks pretty good; there are some significant outliers, but, on the whole, the bars describe an even band of values, which is what we're looking for.

    <font size="+1">Practice</font>

    Practice

    Try it yourself! Calculate the residuals for y_prac and visualize them on a line plot called fig2.

    [ ]:
    y_prac_train_resid = ...
    Let's also take a look at a histogram of the residuals to help us see how they're distributed.

    Let's also take a look at a histogram of the residuals to help us see how they're distributed.

    [ ]:
    y_train_resid.hist();
    Remember, when we make histograms, we're trying to answer two questions: 

    Remember, when we make histograms, we're trying to answer two questions:

    1.) Is it a normal distribution? 2.) Are there any outliers?

    For our histogram, that middle bar is pretty tall, but the shape described by all the bars looks like a normal distribution (albeit a stretched one), so the answer to the first question is "yes." Outliers are values that fall beyond the shape of a normal distribution, and it doesn't look like we have any of those, so the answer to the second question is "no." Those are the answers we're looking for, so let's move on to the next step.

    ### ACF Plots

    ACF Plots¶

    We're going to make an ACF plot to see how much variation there is in the dataset.

    [ ]:
    plot_acf(y_train_resid.dropna(), ax=ax);
    At first, this might seem wrong, but we're actually looking for a mostly-flat graph here. This is an indication that our model describes all the **seasonality**, or regular changes, in our data. In other words, this graph is exactly what we're looking for.

    At first, this might seem wrong, but we're actually looking for a mostly-flat graph here. This is an indication that our model describes all the seasonality, or regular changes, in our data. In other words, this graph is exactly what we're looking for.

    Practice

    Try it yourself! Calculate the make a histogram and an ACF plot of the y_prac data.

    [ ]:
    ​x
     
    [ ]:
    ​x
     
    ## Evaluating the Model

    Evaluating the Model¶

    Now that we've built an autoregression model that seems to be working pretty well, it's time to evaluate it. We've already established that the model works well when compared to itself, but what about how well it works when we start looking outside our original dataset?

    ### Out-of-Sample Predictions

    Out-of-Sample Predictions¶

    To look outside the data, we need to create a new set of predictions. The process here is very similar to the way we made our baseline predictions. We're still using predict, but we're using the test data instead of the train data.

    [ ]:
    y_pred_test = model.predict(y_test.index.min(), y_test.index.max())
    Now that we have a prediction, we can calculate the MAE of our out-of-sample data.

    Now that we have a prediction, we can calculate the MAE of our out-of-sample data.

    [ ]:
    test_mae = mean_absolute_error(y_test, y_pred_test)
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Generate out-of-sample predictions using your y_prac data and model_prac, calculate the MAE, and print the result.

    [ ]:
        y_prac_test.index.min(), y_prac_test.index.max()
    Now that we have some out-of-sample predictions, we can compare it to our in-sample predictions using a line plot. The first step there is to create a new DataFrame called `test1_predictions` with two columns: one for the `y_test` data (the true data) and one for the `y_pred` (the predicted data). It's always a good idea to print the first five rows of a new DataFrame, just to make sure it looks the way it should.

    Now that we have some out-of-sample predictions, we can compare it to our in-sample predictions using a line plot. The first step there is to create a new DataFrame called test1_predictions with two columns: one for the y_test data (the true data) and one for the y_pred (the predicted data). It's always a good idea to print the first five rows of a new DataFrame, just to make sure it looks the way it should.

    [ ]:
        {"y_test": y_test, "y_pred": y_pred_test}, index=y_test.index
    That looks correct, so we can move on to our line plot.

    That looks correct, so we can move on to our line plot.

    [ ]:
    fig = px.line(test1_predictions, labels={"value": "P2"})
    This looks kind of strange, but it's actually exactly what we would expect to see. At the beginning, the `y_pred` data has a fair amount of predictive power, but, as time goes on, the predictions become less and less accurate. It's kind of like what happened with our ACF plots, only in reverse. Last time, the model lost its predictive power as the lag increased. Here, the model loses its predictive power as the horizon — how far away from the present your predictions are — increases.  But don't worry! We'll fix it in a second.

    This looks kind of strange, but it's actually exactly what we would expect to see. At the beginning, the y_pred data has a fair amount of predictive power, but, as time goes on, the predictions become less and less accurate. It's kind of like what happened with our ACF plots, only in reverse. Last time, the model lost its predictive power as the lag increased. Here, the model loses its predictive power as the horizon — how far away from the present your predictions are — increases. But don't worry! We'll fix it in a second.

    Practice

    In the meantime, try it yourself! Make a DataFrame with columns for y_prac_test and y_prac_pred, and print the result. Then, make a line plot that shows the relationship between the two variables.

    [ ]:
    ​x
     
    [ ]:
    ​x
     
    ### Walk-forward Validation

    Walk-forward Validation¶

    Our predictions lose power over time because the model gets farther and farther away from its beginning. But what if we could move that beginning forward with the model? That's what walk-forward validation is. In a walk-forward validation, we re-train the model at for each new observation in the dataset, dropping the data that's the farthest in the past. Let's say that our prediction for what's going to happen at 12:00 is based on what happened at 11:00, 10:00, and 9:00. When we move forward an hour to predict what's going to happen at 1:00, we only use data from 10:00, 11:00, and 12:00, dropping the data from 9:00 because it's now too far in the past. Let's see how it works.

    [ ]:
    # Then, we define a variable that takes into account what's happened in the past
    You'll notice that we're using the same `AutoReg` method we used before, only this time, we're using the `y_train` data. Also like before, the `24` is telling the model how many hours it should pay attention to. If you change that number, the MAE will change too.

    You'll notice that we're using the same AutoReg method we used before, only this time, we're using the y_train data. Also like before, the 24 is telling the model how many hours it should pay attention to. If you change that number, the MAE will change too.

    Speaking of the MAE, let's calculate it and see what we've got.

    [ ]:
    print("Test MAE 1 (walk forward validation):", round(test1_mae, 2))
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Perform a walk-forward validation of your model using the y_prac_train data. Then, calculate the MAE and print the result. Note that because we're using %%capture in the validation cell, you'll need to create a new cell for your MAE calculation.

    [ ]:
    y_prac_pred_wfv = ...
    [ ]:
    test2_mae = ...
    ## Communicating the Results

    Communicating the Results¶

    In machine learning, the model's parameters are the parts of the model that are learned from the training data. There are also hyperparameters, which we'll discuss in the next module. For now, just know that parameters come from inside the model, and hyperparameters are specified outside the model.

    So, let's print the parameters of our validated model and see what it looks like.

    [ ]:
    print(model.params)
    That looks pretty good, but showing it in a line plot would be much better.

    That looks pretty good, but showing it in a line plot would be much better.

    [ ]:
        {"y_test": y_test, "y_pred": y_pred_wfv}, index=y_test.index
    That looks much better! Now our predictions are actually tracking the `test` data, just like they did in the linear regression model.

    That looks much better! Now our predictions are actually tracking the test data, just like they did in the linear regression model.

    Practice

    Try it yourself! Access the parameters of model_prac, put y_prac_test and y_prac_pred_wfv into the test2_predictions DataFrame, and create a line plot using plotly express.

    [ ]:
    ​x
     
    [ ]:
    ​x
     
    # ARMA Models & Hyperparameters

    ARMA Models & Hyperparameters¶

    **ARMA** stands for Auto Regressive Moving Average, and it's a special kind of **time-series** analysis. So far, we've used autoregression (AR) to build our time-series models, and you might recall that AR models rely on values that remain relatively stable over time. That is, they can predict the future very well, as long as the future looks roughly the same as the past. The trouble with predicting the future is that things can suddenly change, and as a result, the future doesn't look much like the past anymore. These sudden changes — economists call them *endogenous shocks* — can be as big as a hurricane destroying a city or an unexpected increase in the minimum wage, and they can be as small as a new restaurant opening in a neighborhood or a single person losing their job. In our data, the air quality might be changed if there was a nearby forest fire, or if a building collapsed near one of the sensors and raised a giant cloud of dust. 

    ARMA stands for Auto Regressive Moving Average, and it's a special kind of time-series analysis. So far, we've used autoregression (AR) to build our time-series models, and you might recall that AR models rely on values that remain relatively stable over time. That is, they can predict the future very well, as long as the future looks roughly the same as the past. The trouble with predicting the future is that things can suddenly change, and as a result, the future doesn't look much like the past anymore. These sudden changes — economists call them endogenous shocks — can be as big as a hurricane destroying a city or an unexpected increase in the minimum wage, and they can be as small as a new restaurant opening in a neighborhood or a single person losing their job. In our data, the air quality might be changed if there was a nearby forest fire, or if a building collapsed near one of the sensors and raised a giant cloud of dust.

    Regardless of the size of the shock, ARMA models can still predict the future. All we need to make that work is data.

    ## Cleaning the Data

    Cleaning the Data¶

    As always, we need to import all the tools we'll need to make our model.

    [ ]:
    from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
    And then we need to get our database client up and running.

    And then we need to get our database client up and running.

    [ ]:
    client = MongoClient(host="localhost", port=27017)
    Then, we need to clean our data. All the examples will use data from Site 3; all the practice sets will use Site 2. If you need a refresher on how all those methods work, refer back to the Autoregression notebook.

    Then, we need to clean our data. All the examples will use data from Site 3; all the practice sets will use Site 2. If you need a refresher on how all those methods work, refer back to the Autoregression notebook.

    [ ]:
    df.index = df.index.tz_localize("UTC").tz_convert("Africa/Lagos")
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Get your client up and running and call your database db_prac. Create a variable called results_prac, and read in a collection called lagos_prac using data from Site 2. Save it as a Series called y_prac.

    [ ]:
    df_prac["P2.L1"] = ...
    ## Exploring the Data

    Exploring the Data¶

    Just like we did with AR, we'll start by exploring the data. Let's make a histogram.

    [ ]:
    y.hist();
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Make a histogram using y_prac.

    [ ]:
    ​x
     
    This is what the data looks like when our sample is 1-hour intervals, but we might want to be able to quickly change our sample to other intervals of time. First, we'll create a function called `wrangle`, and then add an **argument**. In Python, arguments tell the function what to do. This function already has an argument called `collection`, so we'll need to add another to make resampling work. We'll call that argument `resamp_pd`. <span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

    This is what the data looks like when our sample is 1-hour intervals, but we might want to be able to quickly change our sample to other intervals of time. First, we'll create a function called wrangle, and then add an argument. In Python, arguments tell the function what to do. This function already has an argument called collection, so we'll need to add another to make resampling work. We'll call that argument resamp_pd. WQU WorldQuant University Applied Data Science Lab QQQQ

    [ ]:
    # Here's where the new argument goes. We're setting the default value to `"1H"`.
    Now let's change `"1H"` to `"1D"` and see what happens.

    Now let's change "1H" to "1D" and see what happens.

    [ ]:
    y = wrangle(lagos, resamp_pd="1D")
    As you can see on the left side of the table, the samples are now at one day intervals, which is exactly what we wanted!

    As you can see on the left side of the table, the samples are now at one day intervals, which is exactly what we wanted!

    Let's make a new histogram to see if changing the sampling interval made a difference in the data.

    [ ]:
    y.hist();
    This looks pretty different! It's always nice to have a diversified dataset.

    This looks pretty different! It's always nice to have a diversified dataset.

    Practice

    Try it yourself! Define a function called wrangle_prac run it, and print the results of y_prac. Then, create a new histogram from y_prac.

    [ ]:
    print(y_prac)
    [ ]:
    ​x
     
    Like with our AR model, we need to create ACF and PACF plots to see what's happening with the correlation coefficients.

    Like with our AR model, we need to create ACF and PACF plots to see what's happening with the correlation coefficients.

    [ ]:
    fig, ax = plt.subplots(figsize=(15, 6))
    And now let's make a PACF plot.

    And now let's make a PACF plot.

    [ ]:
    fig, ax = plt.subplots(figsize=(15, 6))
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Make a PAC and a PACF plot using your y_prac data.

    [ ]:
    ​x
     
    [ ]:
    ​x
     
    ## Splitting the Data

    Splitting the Data¶

    In our AR model, we split our data based on the number of observations we wanted to investigate. This time, we're going to split our data based on the date, using just the readings from October 2018. So, just like we did before, we'll create a training set using y, but instead of using percentages to split the data, we'll use dates.

    [ ]:
    # Notice that the date format is `YYYY-MM-DD`
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Create a training dataset called y_prac_train based on November 2017. (Hint: there are 30 days in November.)

    [ ]:
    y_prac_train = ...
    ## Building the Model

    Building the Model¶

    ### Baseline

    Baseline¶

    The first thing we need to do is calculate the MAE for our new model:

    [ ]:
    mae_baseline = mean_absolute_error(y_train, y_pred_baseline)
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Calculate the mean and MAE for the y_prac Series, and print the results.

    [ ]:
    y_prac_pred_baseline = ...
    ### Iterating

    Iterating¶

    So far, the only difference between our old AR model and the new ARMA model we're building is that the new model's data is based on the date rather than on the length of the variable. But the difference between AR and ARMA is the addition of hyperparameters.

    ### Hyperparameters

    Hyperparameters¶

    Let's set our p values to include values from 0 to 25, moving in steps of 8:

    [ ]:
    p_params = range(0, 25, 8)
    And let's set our `q` values to include values from 0 to 3, moving in steps of 1:

    And let's set our q values to include values from 0 to 3, moving in steps of 1:

    [ ]:
    q_params = range(0, 3)
    <font size="+1">Practice</font>

    Practice

    Using p_params_prac, set the p value to include vales from 1 to 4, moving in steps of 1. Then, using q_params_prac, set the q value to include values from 0 to 3, moving in steps of 1.

    [ ]:
    p_params_prac = ...
    In order to tell the model to keep going through all the possible combinations, we'll add in a pair of `for` loops. (If you need a refresher on `for` loops, refer to Notebook 001.)

    In order to tell the model to keep going through all the possible combinations, we'll add in a pair of for loops. (If you need a refresher on for loops, refer to Notebook 001.)

    [ ]:
            # Here's where we tell the model how we want it to deal with time
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Create an ARMA model called mode_prac2 based on a dictionary called maes_prac, using your training and test data, then print the results and append the MAE to maes_prac.

    [ ]:
    ​x
     
    Now that we have a working ARMA model, let's turn the output into a DataFrame so we can see what happened.

    Now that we have a working ARMA model, let's turn the output into a DataFrame so we can see what happened.

    [ ]:
    mae_grid = pd.DataFrame(maes)
    And let's visualize the DataFrame in a heatmap. (If you need a refresher on how to create a heatmap in seaborn, refer to Notebook 008.)

    And let's visualize the DataFrame in a heatmap. (If you need a refresher on how to create a heatmap in seaborn, refer to Notebook 008.)

    [ ]:
    plt.title("Grid Search (Criterion: MAE)");
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Turn read the output of your ARMA model into a DataFrame called mae_grid_prac, and visualize it in a heatmap.

    [ ]:
    mae_grid_prac = ...
    It looks like our MAE values are in the right place, but let's try some other ways to explore our new model using the `plot_diagnostics` method.

    It looks like our MAE values are in the right place, but let's try some other ways to explore our new model using the plot_diagnostics method.

    [ ]:
    fig, ax = plt.subplots(figsize=(15, 12))
    As usual, we have quite a lot to sift through here. The first graph is showing us our model's residuals. Ideally, we'd like to see this be as close to zero as possible, and this graph is telling us that, for the most part, we have a good model.

    As usual, we have quite a lot to sift through here. The first graph is showing us our model's residuals. Ideally, we'd like to see this be as close to zero as possible, and this graph is telling us that, for the most part, we have a good model.

    The next graph over shows us another version of the same thing. The histogram is similar to the one we made before, but there's a pair of lines superimposed. These lines are indicating the kernel density, which is another way of saying that it's a smoothed-out version of the blue histogram bars. The green line represents a normal distribution, and is included here just to give you something to compare to the values from your model. The orange line represents the smoothed-out version of the result off your model. Our model is actually pretty close to a normal distribution, so that's good!

    The Q-Q plot on the bottom left is yet another way to visualize the same thing. Here, the red line is showing us a perfect 1:1 correlation between our variables, and the wavy blue line is showing us what we actually have. Again, our model is pretty close to the red line, so it's looking good.

    And finally, we have a correlogram, which might look familiar; it's the same kind of plot as the ACF and PACF plots from our AR models.

    Practice

    Try it yourself! Use plot_diagnostics to examine the residuals from model_prac.

    [ ]:
    ​x
     
    ## Communicating the Results

    Communicating the Results¶

    Now that we have an ARMA model that seems to be working well, it's time to communicate the results of our analysis in a line graph. Let's create a graph that shows the relationship between our training and predicting data. (For a refresher on how to do this and what it means, refer to the AR notebook.)

    [ ]:
        {"y_train": y_train, "y_pred": y_train_pred}, index=y_train.index
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Create a line plot that compares the y_prac_train and y_prac_pred values.

    [ ]:
    ​x
     
    # ARCH and GARCH Models

    ARCH and GARCH Models¶

    **ARCH** stands for autoregressive conditionally heteroscedastic, which models the variance of a time series. **ARCH** model assumes variance at time $t$ depends on the **past squared observations**. A **GARCH** (generalized autoregressive conditionally heteroscedastic) model uses values of the **past squared observations** and **past variances** to model the variance at time $t$. **ARCH** and **GARCH** models are widely used in finance and econometric time series analysis. For more details, check out these YouTube videos. 

    ARCH stands for autoregressive conditionally heteroscedastic, which models the variance of a time series. ARCH model assumes variance at time 𝑡t depends on the past squared observations. A GARCH (generalized autoregressive conditionally heteroscedastic) model uses values of the past squared observations and past variances to model the variance at time 𝑡t. ARCH and GARCH models are widely used in finance and econometric time series analysis. For more details, check out these YouTube videos.

    [ ]:
    YouTubeVideo("Li95a2biFCU")
    [ ]:
    YouTubeVideo("inoBpq1UEn4")
    Let's see an example of using **GARCH** model in forecasting Apple stock price volatility. We first import the data.

    Let's see an example of using GARCH model in forecasting Apple stock price volatility. We first import the data.

    [ ]:
    df = pd.read_csv("./data/AAPL.csv", parse_dates=["date"], index_col="date")
    The dataset range from 1999 to 2022. For simplicity, we only take 5 years of data.

    The dataset range from 1999 to 2022. For simplicity, we only take 5 years of data.

    [ ]:
    df = df["2015-10-13":]
    ## Calculating Returns

    Calculating Returns¶

    The next step is to calculate the volatility of the stock close prices. Here we can use the `pct_change` function from pandas to calculate the daily percentage change, then multiply the result by 100 to get the returns:

    The next step is to calculate the volatility of the stock close prices. Here we can use the pct_change function from pandas to calculate the daily percentage change, then multiply the result by 100 to get the returns:

    [ ]:
    df["return"] = df["close"].pct_change() * 100
    We can then take out the return column as our training data. Note the first observation is missing since there is no past value to observe. We need to drop it:

    We can then take out the return column as our training data. Note the first observation is missing since there is no past value to observe. We need to drop it:

    [ ]:
    AAPL_return = df["return"].dropna()
    We can check the histogram to see the distribution of the returns over the past five years:

    We can check the histogram to see the distribution of the returns over the past five years:

    [ ]:
    plt.title("Distribution of AAPL Daily Returns");
    There's a negative outlier in this date range, with the `idxmin` function, we find out it was in 31 August 2020. The stock price has a huge drop due to a [stock split](https://www.cnbc.com/2020/08/31/history-of-apple-stock-splits-says-dont-rush-in-to-buy-cheaper-shares.html).

    There's a negative outlier in this date range, with the idxmin function, we find out it was in 31 August 2020. The stock price has a huge drop due to a stock split.

    [ ]:
    AAPL_return.idxmin(), AAPL_return.min()
    We can also check the standard deviation of the whole dataset with the pandas `std()` function:

    We can also check the standard deviation of the whole dataset with the pandas std() function:

    [ ]:
    AAPL_return.std()
    To see the statistic more clearly, we can use some time series plots. In addition to the Apple stock price return plot, we can also plot rolling volatility of the return to smooth out the noise, and see how return and volatility are associated with each other. First we can calculate a 50-day rolling standard deviation series for Apple stock price return:

    To see the statistic more clearly, we can use some time series plots. In addition to the Apple stock price return plot, we can also plot rolling volatility of the return to smooth out the noise, and see how return and volatility are associated with each other. First we can calculate a 50-day rolling standard deviation series for Apple stock price return:

    [ ]:
    AAPL_return_rolling_50d_volatility = AAPL_return.rolling(window=50).std().dropna()
    We can plot these two series:

    We can plot these two series:

    [ ]:
        ax=ax, label="50d rolling volatility", linewidth=3
    Here we can see that volatility goes up when the returns change drastically up or down. For instance, when return dropped enormously in 2020 August, volatility also increased a lot. However, this plot reveals a problem. We want to use returns to see if high volatility on one day is associated with high volatility on the following day. But high volatility is caused by large changes in returns, which can be either positive or negative. How can we assess negative and positive numbers together without them canceling each other out? The common solution used in building ARCH or GARCH models are to square the returns. Let's plot the squared returns:

    Here we can see that volatility goes up when the returns change drastically up or down. For instance, when return dropped enormously in 2020 August, volatility also increased a lot. However, this plot reveals a problem. We want to use returns to see if high volatility on one day is associated with high volatility on the following day. But high volatility is caused by large changes in returns, which can be either positive or negative. How can we assess negative and positive numbers together without them canceling each other out? The common solution used in building ARCH or GARCH models are to square the returns. Let's plot the squared returns:

    [ ]:
    (AAPL_return**2).plot(ax=ax, label="daily return")
     Now we it much easier to see groups high and low volatility and build a GARCH model. To build a GARCH model, we need to figure out both the `p` and `q` parameters. The `p` parameter is handling correlations at prior time steps and the `q` parameter deals with prior variances, like shocks. It also uses the notion of lag. To see how many lags we should have in our model, we should create an ACF and PACF plot — but using the squared returns. 

    Now we it much easier to see groups high and low volatility and build a GARCH model. To build a GARCH model, we need to figure out both the p and q parameters. The p parameter is handling correlations at prior time steps and the q parameter deals with prior variances, like shocks. It also uses the notion of lag. To see how many lags we should have in our model, we should create an ACF and PACF plot — but using the squared returns.

    [ ]:
    fig, ax = plt.subplots(figsize=(15, 6))
    [ ]:
    fig, ax = plt.subplots(figsize=(15, 6))
    Both the ACF and PACF graph show one lag is enough to build the model.

    Both the ACF and PACF graph show one lag is enough to build the model.

    ## Building the Model

    Building the Model¶

    [ ]:
    model = arch_model(AAPL_return, p=1, q=1, rescale=False).fit(disp=0)
    ## Common Metrics

    Common Metrics¶

    The model summary provides a lot of information about the model, like the trained parameters, model performance, etc. **AIC** and **BIC** are important measurements for model performance. **AIC** stands for Akaike Information Criterion, it measures the goodness of fit of any estimated statistical model. **BIC** is Bayesian Information Criteria that selects model from a finite set of models. When fitting models, it is possible to increase model performance by adding more parameters, which would also result in overfitting. The BIC resolves this problem by introducing a penalty term for the number of parameters in the model. The penalty term is larger in BIC than in AIC. Though BIC is always higher than AIC, lower the value of these two measures, better the model.

    The model summary provides a lot of information about the model, like the trained parameters, model performance, etc. AIC and BIC are important measurements for model performance. AIC stands for Akaike Information Criterion, it measures the goodness of fit of any estimated statistical model. BIC is Bayesian Information Criteria that selects model from a finite set of models. When fitting models, it is possible to increase model performance by adding more parameters, which would also result in overfitting. The BIC resolves this problem by introducing a penalty term for the number of parameters in the model. The penalty term is larger in BIC than in AIC. Though BIC is always higher than AIC, lower the value of these two measures, better the model.

    ## Standardized Residuals

    Standardized Residuals¶

    After fitting the model with the data, we can get also get the **residuals** to see how the model performs. The residual at time $t$ is the observed return at time $t$ minus the model's estimated mean return at time $t$. For other models that model observations directly, like stock prices, the assumptions are the residuals are the noises that follow a normal distribution. In GARCH model, since we are model returns, which is the variance of the stock prices, the residuals of the variance will not follow a normal distribution. So we need to use **standardized residuals** instead of residuals to check the normality. Standardized residual at time $t$ is the residual at time $t$ divided by the square root of the model estimated variance at time $t$. Let's check the plot of the standardized residuals across the time series:

    After fitting the model with the data, we can get also get the residuals to see how the model performs. The residual at time 𝑡t is the observed return at time 𝑡t minus the model's estimated mean return at time 𝑡t. For other models that model observations directly, like stock prices, the assumptions are the residuals are the noises that follow a normal distribution. In GARCH model, since we are model returns, which is the variance of the stock prices, the residuals of the variance will not follow a normal distribution. So we need to use standardized residuals instead of residuals to check the normality. Standardized residual at time 𝑡t is the residual at time 𝑡t divided by the square root of the model estimated variance at time 𝑡t. Let's check the plot of the standardized residuals across the time series:

    [ ]:
    model.std_resid.plot(ax=ax, label="Standardized Residuals")
    The standardized residuals are moving around 0 except for the outlier events from 2020 August. Let's check the normality by the histogram: 

    The standardized residuals are moving around 0 except for the outlier events from 2020 August. Let's check the normality by the histogram:

    [ ]:
    plt.title("Distribution of Standardized Residuals");
    If we exclude the outlier, the other standardized residuals do follow a normal distribution with mean 0.

    If we exclude the outlier, the other standardized residuals do follow a normal distribution with mean 0.

    ## Evaluation

    Evaluation¶

    We can further evaluate the model by comparing its forecast with a subset of the observed returns to see whether the model has successfully captured the volatility. We can first check the model conditional volatility in its confidence interval throughout the dataset:

    [ ]:
    (-2 * model.conditional_volatility.rename("")).plot(ax=ax, color="C1")
    We can then select the test set of the data and do a walk-forward validation on the GARCH model:

    We can then select the test set of the data and do a walk-forward validation on the GARCH model:

    [ ]:
        next_pred = model.forecast(horizon=1, reindex=False).variance.values[0][0] ** 0.5
    We can plot the predicted volatility versus return in this test set, and see the model has captured most of the variance here:

    We can plot the predicted volatility versus return in this test set, and see the model has captured most of the variance here:

    [ ]:
    (2 * y_test_wfv).plot(ax=ax, c="C1", label="2 SD Predicted Volatility")
    ## Forecasting

    Forecasting¶

    The last step is to make the forecasts with our trained model. Let's make a one-day forecast first to see how GARCH model forecasting works:

    The last step is to make the forecasts with our trained model. Let's make a one-day forecast first to see how GARCH model forecasting works:

    [ ]:
    one_day_forecast = model.forecast(horizon=1, reindex=False).variance
    There are two things we need to keep in mind here. First, our model forecast shows the predicted variance, not the standard deviation / volatility. So we'll need to take the square root of the value. Second, the prediction's index is in the format of `"h.1"` as the next date of the last training data date. It's not in the form of DatetimeIndex that we desire. So we need to format the forecasts to be more readable.

    There are two things we need to keep in mind here. First, our model forecast shows the predicted variance, not the standard deviation / volatility. So we'll need to take the square root of the value. Second, the prediction's index is in the format of "h.1" as the next date of the last training data date. It's not in the form of DatetimeIndex that we desire. So we need to format the forecasts to be more readable.

    We first generate 5 days predictions with our trained GARCH model:

    We first generate 5 days predictions with our trained GARCH model:

    [ ]:
    prediction_dates = pd.bdate_range(start=start, periods=prediction.shape[1])
    Then we define a function to clean the predictions to be more readable:

    Then we define a function to clean the predictions to be more readable:

    [ ]:
    prediction_formatted = pd.Series(data, index=prediction_index)
    # References & Further Reading

    References & Further Reading¶

    • More on ARMA models
    • Even more in ARMA models
    • Information on p and q parameters
    • A primer on autoregression
    • More information on ACF plots
    • A primer on ACF and PACF
    • Background on residuals
    • More on walk-forward validation
    • Reading on parameters and hyperparameters
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    ​x
     

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    <font size="+3"><strong>3.1. Wrangling Data with MongoDB 🇰🇪</strong></font>

    3.1. Wrangling Data with MongoDB 🇰🇪

    [1]:
    from IPython.display import VimeoVideo
    [2]:
    VimeoVideo("665412094", h="8334dfab2e", width=600)
    [2]:
    [3]:
    VimeoVideo("665412135", h="dcff7ab83a", width=600)
    [3]:
    **Task 3.1.1:** Instantiate a `PrettyPrinter`, and assign it to the variable `pp`.

    Task 3.1.1: Instantiate a PrettyPrinter, and assign it to the variable pp.

    • Construct a PrettyPrinter instance in pprint.
    [4]:
    pp = PrettyPrinter(indent = 2)
    # Prepare Data

    Prepare Data¶

    ## Connect

    Connect¶

    [5]:
    VimeoVideo("665412155", h="1ca0dd03d0", width=600)
    [5]:
    **Task 3.1.2:** Create a client that connects to the database running at `localhost` on port `27017`.

    Task 3.1.2: Create a client that connects to the database running at localhost on port 27017.

    • What's a database client?
    • What's a database server?
    • Create a client object for a MongoDB instance.
    [6]:
    client = MongoClient(host="localhost",port=27017)
    ## Explore

    Explore¶

    [7]:
    VimeoVideo("665412176", h="6fea7c6346", width=600)
    [7]:
    **Task 3.1.3:** Print a list of the databases available on `client`.

    Task 3.1.3: Print a list of the databases available on client.

    • What's an iterator?
    • List the databases of a server using PyMongo.
    • Print output using pprint.
    [8]:
    pp.pprint(list(client.list_databases()))
    [ {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
      {'empty': False, 'name': 'air-quality', 'sizeOnDisk': 7000064},
      {'empty': False, 'name': 'config', 'sizeOnDisk': 110592},
      {'empty': False, 'name': 'local', 'sizeOnDisk': 73728},
      {'empty': False, 'name': 'wqu-abtest', 'sizeOnDisk': 585728}]
    
    [9]:
    VimeoVideo("665412216", h="7d4027dc33", width=600)
    [9]:
    **Task 3.1.4:** Assign the `"air-quality"` database to the variable `db`.

    Task 3.1.4: Assign the "air-quality" database to the variable db.

    • What's a MongoDB database?
    • Access a database using PyMongo.
    [10]:
    db = client["air-quality"]
    [11]:
    VimeoVideo("665412231", h="89c546b00f", width=600)
    [11]:
    **Task 3.1.5:** Use the [`list_collections`](https://pymongo.readthedocs.io/en/stable/api/pymongo/database.html?highlight=list_collections#pymongo.database.Database.list_collections) method to print a list of the collections available in `db`.

    Task 3.1.5: Use the list_collections method to print a list of the collections available in db.

    • What's a MongoDB collection?
    • List the collections in a database using PyMongo.
    [12]:
    for c in db.list_collections():
    system.views
    lagos
    system.buckets.lagos
    nairobi
    system.buckets.nairobi
    dar-es-salaam
    system.buckets.dar-es-salaam
    
    [13]:
    VimeoVideo("665412252", h="bff2abbdc0", width=600)
    [13]:
    **Task 3.1.6:** Assign the `"nairobi"` collection in `db` to the variable name `nairobi`.

    Task 3.1.6: Assign the "nairobi" collection in db to the variable name nairobi.

    • Access a collection in a database using PyMongo.
    [14]:
    nairobi = db["nairobi"]
    [15]:
    VimeoVideo("665412270", h="e4a5f5c84b", width=600)
    [15]:
    **Task 3.1.7:** Use the [`count_documents`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.count_documents) method to see how many documents are in the `nairobi` collection.

    Task 3.1.7: Use the count_documents method to see how many documents are in the nairobi collection.

    • What's a MongoDB document?
    • Count the documents in a collection using PyMongo.
    [16]:
    nairobi.count_documents({})
    [16]:
    202212
    [17]:
    VimeoVideo("665412279", h="c2315f3be1", width=600)
    [17]:
    **Task 3.1.8:** Use the [`find_one`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.find_one) method to retrieve one document from the `nairobi` collection, and assign it to the variable name `result`.

    Task 3.1.8: Use the find_one method to retrieve one document from the nairobi collection, and assign it to the variable name result.

    • What's metadata?
    • What's semi-structured data?
    • Retrieve a document from a collection using PyMongo.
    [18]:
    result = nairobi.find_one({})
    { 'P1': 39.67,
      '_id': ObjectId('6334b0e98c51459f9b198d27'),
      'metadata': { 'lat': -1.3,
                    'lon': 36.785,
                    'measurement': 'P1',
                    'sensor_id': 57,
                    'sensor_type': 'SDS011',
                    'site': 29},
      'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}
    
    [19]:
    VimeoVideo("665412306", h="e1e913dfd1", width=600)
    [19]:
    **Task 3.1.9:** Use the [`distinct`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.distinct) method to determine how many sensor sites are included in the `nairobi` collection.

    Task 3.1.9: Use the distinct method to determine how many sensor sites are included in the nairobi collection.

    • Get a list of distinct values for a key among all documents using PyMongo.
    [20]:
    nairobi.distinct("metadata.site")
    [20]:
    [6, 29]
    [21]:
    VimeoVideo("665412322", h="4776c6d548", width=600)
    [21]:
    **Task 3.1.10:** Use the [`count_documents`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.count_documents) method to determine how many readings there are for each site in the `nairobi` collection.

    Task 3.1.10: Use the count_documents method to determine how many readings there are for each site in the nairobi collection.

    • Count the documents in a collection using PyMongo. WQU WorldQuant University Applied Data Science Lab QQQQ
    [22]:
    print("Documents from site 29:",nairobi.count_documents({"metadata.site":29}))
    Documents from site 6: 70360
    Documents from site 29: 131852
    
    [23]:
    VimeoVideo("665412344", h="d2354584cd", width=600)
    [23]:
    **Task 3.1.11:** Use the [`aggregate`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.aggregate) method to determine how many readings there are for each site in the `nairobi` collection.

    Task 3.1.11: Use the aggregate method to determine how many readings there are for each site in the nairobi collection.

    • Perform aggregation calculations on documents using PyMongo.
    [24]:
            {"$group":{"_id":"$metadata.site","count":{"$count":{}}}}
    [{'_id': 6, 'count': 70360}, {'_id': 29, 'count': 131852}]
    
    [25]:
    VimeoVideo("665412372", h="565122c9cc", width=600)
    [25]:
    **Task 3.1.12:** Use the [`distinct`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.distinct) method to determine how many types of measurements have been taken in the `nairobi` collection.

    Task 3.1.12: Use the distinct method to determine how many types of measurements have been taken in the nairobi collection.

    • Get a list of distinct values for a key among all documents using PyMongo.
    [26]:
    nairobi.distinct("metadata.measurement")
    [26]:
    ['P2', 'humidity', 'P1', 'temperature']
    [27]:
    VimeoVideo("665412380", h="f7f7a39bb3", width=600)
    [27]:
    **Task 3.1.13:** Use the [`find`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.find) method to retrieve the PM 2.5 readings from all sites. Be sure to limit your results to 3 records only.

    Task 3.1.13: Use the find method to retrieve the PM 2.5 readings from all sites. Be sure to limit your results to 3 records only.

    • Query a collection using PyMongo.
    [28]:
    result = nairobi.find({"metadata.measurement":"P2"}).limit(3)
    [ { 'P2': 34.43,
        '_id': ObjectId('6334b0ea8c51459f9b1a0db2'),
        'metadata': { 'lat': -1.3,
                      'lon': 36.785,
                      'measurement': 'P2',
                      'sensor_id': 57,
                      'sensor_type': 'SDS011',
                      'site': 29},
        'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
      { 'P2': 30.53,
        '_id': ObjectId('6334b0ea8c51459f9b1a0db3'),
        'metadata': { 'lat': -1.3,
                      'lon': 36.785,
                      'measurement': 'P2',
                      'sensor_id': 57,
                      'sensor_type': 'SDS011',
                      'site': 29},
        'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
      { 'P2': 22.8,
        '_id': ObjectId('6334b0ea8c51459f9b1a0db4'),
        'metadata': { 'lat': -1.3,
                      'lon': 36.785,
                      'measurement': 'P2',
                      'sensor_id': 57,
                      'sensor_type': 'SDS011',
                      'site': 29},
        'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]
    
    [29]:
    VimeoVideo("665412389", h="8976ea3090", width=600)
    [29]:
    **Task 3.1.14:** Use the [`aggregate`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.aggregate) method to calculate how many readings there are for each type (`"humidity"`, `"temperature"`, `"P2"`, and `"P1"`) in site `6`.

    Task 3.1.14: Use the aggregate method to calculate how many readings there are for each type ("humidity", "temperature", "P2", and "P1") in site 6.

    • Perform aggregation calculations on documents using PyMongo.
    [30]:
            {"$group":{"_id":"$metadata.measurement","count":{"$count":{}}}}
    [ {'_id': 'P2', 'count': 18169},
      {'_id': 'humidity', 'count': 17011},
      {'_id': 'P1', 'count': 18169},
      {'_id': 'temperature', 'count': 17011}]
    
    [31]:
    VimeoVideo("665412418", h="0c4b125254", width=600)
    [31]:
    **Task 3.1.15:** Use the [`aggregate`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.aggregate) method to calculate how many readings there are for each type (`"humidity"`, `"temperature"`, `"P2"`, and `"P1"`) in site `29`.

    Task 3.1.15: Use the aggregate method to calculate how many readings there are for each type ("humidity", "temperature", "P2", and "P1") in site 29.

    • Perform aggregation calculations on documents using PyMongo.
    [32]:
            {"$group":{"_id":"$metadata.measurement","count":{"$count":{}}}}
    [ {'_id': 'P2', 'count': 32907},
      {'_id': 'humidity', 'count': 33019},
      {'_id': 'P1', 'count': 32907},
      {'_id': 'temperature', 'count': 33019}]
    []
    
    ## Import

    Import¶

    [33]:
    VimeoVideo("665412437", h="7a436c7e7e", width=600)
    [33]:
    **Task 3.1.16:** Use the [`find`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.find) method to retrieve the PM 2.5 readings from site `29`. Be sure to limit your results to 3 records only. Since we won't need the metadata for our model, use the `projection` argument to limit the results to the `"P2"` and `"timestamp"` keys only.

    Task 3.1.16: Use the find method to retrieve the PM 2.5 readings from site 29. Be sure to limit your results to 3 records only. Since we won't need the metadata for our model, use the projection argument to limit the results to the "P2" and "timestamp" keys only.

    • Query a collection using PyMongo.
    [34]:
        {"metadata.site":29,"metadata.measurement":"P2"},
    [35]:
    VimeoVideo("665412442", h="494636d1ea", width=600)
    [35]:
    **Task 3.1.17:** Read records from your `result` into the DataFrame `df`. Be sure to set the index to `"timestamp"`.

    Task 3.1.17: Read records from your result into the DataFrame df. Be sure to set the index to "timestamp".

    • Create a DataFrame from a dictionary using pandas.
    [36]:
    df = pd.DataFrame(result).set_index("timestamp")
    [36]:
    P2
    timestamp
    2018-09-01 00:00:02.472 34.43
    2018-09-01 00:05:03.941 30.53
    2018-09-01 00:10:04.374 22.80
    2018-09-01 00:15:04.245 13.30
    2018-09-01 00:20:04.869 16.57
    [37]:
    assert isinstance(df.index, pd.DatetimeIndex), "`df` should have a `DatetimeIndex`."
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    ​x
     

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    <font size="+3"><strong>Databases: PyMongo</strong></font>

    Databases: PyMongo

    # Working with PyMongo

    Working with PyMongo¶

    For all of these examples, we're going to be working with the "lagos" collection in the "air-quality" database. Before we can do anything else, we need to bring in pandas (which we won't use until the very end), pprint (a module that lets us see the data in an understandable way), and PyMongo (a library for working with MongoDB databases).

    [1]:
    from pprint import PrettyPrinter
    ## Databases

    Databases¶

    Data comes to us in lots of different ways, and one of those ways is in a database. A database is a collection of data.

    ## Servers and Clients

    Servers and Clients¶

    Next, we need to connect to a server. A database server is where the database resides; it can be accessed using a client. Without a client, a database is just a collection of information that we can't work with, because we have no way in. We're going to be learning more about a database called MongoDB, and we'll use PrettyPrinter to make the information it generates easier to understand. Here's how the connection works:

    [2]:
    client = MongoClient(host="localhost", port=27017)
    ## Semi-structured Data

    Semi-structured Data¶

    Databases are designed to work with either structured data or semi-structured data. In this part of the course, we're going to be working with databases that contain semi-structured data. Data is semi-structured when it has some kind of organizing logic, but that logic can't be displayed using rows and columns. Your email account contains semi-structured data if it’s divided into sections like Inbox, Sent, and Trash. If you’ve ever seen tweets from Twitter grouped by hashtag, that’s semi-structured data too. Semi-structured data is also used in sensor readings, which is what we'll be working with here.

    ## Exploring a Database

    Exploring a Database¶

    So, now that we're connected to a server, let's take a look at what's there. Working our way down the specificity scale, the first thing we need to do is figure out which databases are on this server. To see which databases on the server, we'll use the list_databases method, like this:

    [3]:
    pp.pprint(list(client.list_databases()))
    [ {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
      {'empty': False, 'name': 'air-quality', 'sizeOnDisk': 7000064},
      {'empty': False, 'name': 'config', 'sizeOnDisk': 98304},
      {'empty': False, 'name': 'local', 'sizeOnDisk': 73728},
      {'empty': False, 'name': 'wqu-abtest', 'sizeOnDisk': 585728}]
    
    It looks like this server contains four databases: `"admin"`, `"air-quality"`, `"config"`, and `"local"`. We're only interested in `"air-quality"`, so let's connect to that one:

    It looks like this server contains four databases: "admin", "air-quality", "config", and "local". We're only interested in "air-quality", so let's connect to that one:

    [4]:
    db = client["air-quality"]
    In MongoDB, a **database** is a container for **collections**. Each database gets its own set of files, and a single MongoDB **server** typically has multiple databases.

    In MongoDB, a database is a container for collections. Each database gets its own set of files, and a single MongoDB server typically has multiple databases.

    ## Collections

    Collections¶

    Let's use a for loop to take a look at the collections in the "air-quality" database:

    [5]:
    for c in db.list_collections():
    system.views
    lagos
    system.buckets.lagos
    nairobi
    system.buckets.nairobi
    dar-es-salaam
    system.buckets.dar-es-salaam
    
    As you can see, there are three actual collections here: `"nairobi"`, `"lagos"`, and `"dar-es-salaam"`. Since we're only interested in the `"lagos"` collection, let's get it on its own like this: 

    As you can see, there are three actual collections here: "nairobi", "lagos", and "dar-es-salaam". Since we're only interested in the "lagos" collection, let's get it on its own like this:

    [6]:
    lagos = db["lagos"]
    ## Documents

    Documents¶

    A MongoDB **document** is an individual record of data in a **collection**, and is the basic unit of analysis in MongoDB. Documents come with **metadata** that helps us understand what the document is; we'll get back to that in a minute. In the meantime, let's use the [`count_documents`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.count_documents) method to see how many documents the `"lagos"` collection contains:

    A MongoDB document is an individual record of data in a collection, and is the basic unit of analysis in MongoDB. Documents come with metadata that helps us understand what the document is; we'll get back to that in a minute. In the meantime, let's use the count_documents method to see how many documents the "lagos" collection contains:

    [7]:
    lagos.count_documents({})
    [7]:
    166496
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Bring in all the necessary libraries and modules, then connect to the "air-quality" database and print the number of documents in the "nairobi" collection.

    [8]:
    client = MongoClient(host = "localhost", port = 27017)
    [    {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
         {'empty': False, 'name': 'air-quality', 'sizeOnDisk': 7000064},
         {'empty': False, 'name': 'config', 'sizeOnDisk': 98304},
         {'empty': False, 'name': 'local', 'sizeOnDisk': 73728},
         {'empty': False, 'name': 'wqu-abtest', 'sizeOnDisk': 585728}]
    system.views
    lagos
    system.buckets.lagos
    nairobi
    system.buckets.nairobi
    dar-es-salaam
    system.buckets.dar-es-salaam
    
    [8]:
    202212
    ### Retrieving Data

    Retrieving Data¶

    Now that we know how many documents the "lagos" collection contains, let's take a closer look at what's there. The first thing you'll notice is that the output starts out with a curly bracket ({), and ends with a curly bracket (}). That tells us that this information is a dictionary. To access documents in the collection, we'll use two methods: find and find_one. As you might expect, find will retrieve all the documents, and find_one will bring back only the first document. For now, let's stick to find_one; we'll come back to find later.

    Just like everywhere else, we'll need to assign a variable name to whatever comes back, so let's call this one result.

    [9]:
    result = lagos.find_one({})
    {    '_id': ObjectId('6334b0f18c51459f9b1d955d'),
         'metadata': {    'lat': 6.501,
                          'lon': 3.367,
                          'measurement': 'temperature',
                          'sensor_id': 10,
                          'sensor_type': 'DHT11',
                          'site': 4},
         'temperature': nan,
         'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 88000)}
    
    ### Key-Value Pairs

    Key-Value Pairs¶

    There's a lot going on here! Let's work from the bottom up, starting with this:

    {
        'temperature': 27.0,
        'timestamp': datetime.datetime(2017, 9, 6, 13, 18, 10, 120000)
    }
    

    The actual data is labeled temperature and timestamp, and if seeing it presented this way seems familiar, that's because what you're seeing at the bottom are two key-value pairs. In PyMongo, "_id" is always the primary key. Primary keys are the column(s) which contain values that uniquely identify each row in a table; we'll talk about that more in a minute.

    ### Metadata

    Metadata¶

    Next, we have this:

    'metadata': { 'lat': 6.602,
                  'lon': 3.351,
                  'measurement': 'temperature',
                  'sensor_id': 9,
                  'sensor_type': 'DHT11',
                  'site': 2}
    

    This is the document's metadata. Metadata is data about the data. If you’re working with a database, its data is the information it contains, and its metadata describes what that information is. In MongoDB, each document often has metadata of its own. If we go back to the example of your email account, each message in your Sent folder includes both the message itself and information about when you sent it and who you sent it to; the message is data, and the other information is metadata.

    The metadata we see in this block of code tells us what the key-value pairs from the last code block mean, and where the information stored there comes from. There's location data, a line telling us what about the format of the key-value pairs, some information about the equipment used to gather the data, and where the data came from.

    ### Identifiers

    Identifiers¶

    Finally, at the top, we have this:

    { 
        '_id': ObjectId('6126f1780e45360640bf240a')
    }
    

    This is the document's unique identifier, which is similar to the index label for each row in a pandas DataFrame.

    <font size="+1">Practice</font>

    Practice

    Try it yourself! Retrieve a single document from the "nairobi" collection, and print the result.

    [10]:
    result = nairobi.find_one({})
    {    'P1': 39.67,
         '_id': ObjectId('6334b0e98c51459f9b198d27'),
         'metadata': {    'lat': -1.3,
                          'lon': 36.785,
                          'measurement': 'P1',
                          'sensor_id': 57,
                          'sensor_type': 'SDS011',
                          'site': 29},
         'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}
    
    ## Analyzing Data

    Analyzing Data¶

    Now that we've seen what a document looks like in this collection, let's start working with what we've got. Since our metadata includes information about each sensor's "site", we might be curious to know how many sites are in the "lagos" collection. To do that, we'll use the distinct method, like this:

    [11]:
    lagos.distinct("metadata.site")
    [11]:
    [3, 4]
    Notice that in order to grab the `"site"` number, we needed to include the `"metadata"` tag. 

    Notice that in order to grab the "site" number, we needed to include the "metadata" tag.

    This tells us that there are 2 sensor sites in Lagos: one labeled 3 and the other labeled 4.

    Let's go further. We know that there are two sensor sites in Lagos, but we don't know how many documents are associated with each site. To find that out, we'll use the count_documents method for each site.

    [12]:
    print("Documents from site 3:", lagos.count_documents({"metadata.site": 3}))
    Documents from site 3: 140586
    Documents from site 4: 25910
    
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Find out how many sensor sites are in Nairobi, what their labels are, and how many documents are associated with each one.

    [13]:
    print("Documents from site 29:", nairobi.count_documents({"metadata.site":29}))
    Documents from site 29: 131852
    Documents from site 6: 70360
    
    [14]:
    print("Documents from site 29:", nairobi.count_documents({"metadata.site": 29}))
    Documents from site 29: 131852
    Documents from site 6: 70360
    
    Now that we know how many *documents* are associated with each site, let's keep drilling down and find the number of *readings* for each site. We'll do this with the [`aggregate`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.aggregate) method.

    Now that we know how many documents are associated with each site, let's keep drilling down and find the number of readings for each site. We'll do this with the aggregate method.

    Before we run it, let's take a look at some of what's happening in the code here. First, you'll notice that there are several dollar signs ($) in the list. This is telling the collection that we want to create something new. Here, we're saying that we want there to be a new group, and that the new group needs to be updated with data from metadata.site, and then updated again with data from count.

    There's also a new field: "_id". In PyMongo, "_id" is always the primary key. Primary keys are the fields which contain values that uniquely identify each row in a table.

    Let's run the code and see what happens:

    [15]:
        [{"$group": {"_id": "$metadata.site", "count": {"$count": {}}}}]
    [{'_id': 4, 'count': 25910}, {'_id': 3, 'count': 140586}]
    
    With that information in mind, we might want to know what those readings actually are. Since we're really interested in measures of air quality, let's take a look at the `P2` values in the `"lagos"` collection. `P2` measures the amount of particulate matter in the air, which in this case is something called PM 2.5. If we wanted to get all the documents in a collection, we could, but that would result in an unmanageably large number of records clogging up the memory on our machines. Instead, let's use the [`find`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.find) method and use `limit` to make sure we only get back the first 3. 

    With that information in mind, we might want to know what those readings actually are. Since we're really interested in measures of air quality, let's take a look at the P2 values in the "lagos" collection. P2 measures the amount of particulate matter in the air, which in this case is something called PM 2.5. If we wanted to get all the documents in a collection, we could, but that would result in an unmanageably large number of records clogging up the memory on our machines. Instead, let's use the find method and use limit to make sure we only get back the first 3.

    [16]:
    result = lagos.find({"metadata.measurement": "P2"}).limit(3)
    [    {    'P2': 14.42,
              '_id': ObjectId('6334b0f28c51459f9b1de145'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 39000)},
         {    'P2': 19.66,
              '_id': ObjectId('6334b0f28c51459f9b1de146'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 11, 23, 870000)},
         {    'P2': 24.79,
              '_id': ObjectId('6334b0f28c51459f9b1de147'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 21, 53, 981000)}]
    
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Find out how many sensor sites are in Nairobi, what their labels are, how many documents are associated with each one, and the number of observations from each site. Then, return the first three documents with the value P2.

    [29]:
        [{"$group": {"_id": "$metadata.site", "count": {"$count": {}}}}]
    [{'_id': 29, 'count': 131852}, {'_id': 6, 'count': 70360}]
    [    {    'P2': 14.42,
              '_id': ObjectId('6334b0f28c51459f9b1de145'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 39000)},
         {    'P2': 19.66,
              '_id': ObjectId('6334b0f28c51459f9b1de146'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 11, 23, 870000)},
         {    'P2': 24.79,
              '_id': ObjectId('6334b0f28c51459f9b1de147'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 21, 53, 981000)}]
    
    So far, we've been dealing with relatively small subsets of the data in our collections, but what if we need to work with something bigger? Let's start by using `distinct` to remind ourselves of the kinds of data we have at our disposal.

    So far, we've been dealing with relatively small subsets of the data in our collections, but what if we need to work with something bigger? Let's start by using distinct to remind ourselves of the kinds of data we have at our disposal.

    [18]:
    lagos.distinct("metadata.measurement")
    [18]:
    ['temperature', 'P1', 'humidity', 'P2']
    There are also comparison query operators that can be helpful for filtering the data. In total, we have 

    There are also comparison query operators that can be helpful for filtering the data. In total, we have

    • $gt: greater than (>)
    • $lt: less than (<)
    • $gte: greater than equal to (>=)
    • $lte: less than equal to (<= )

    Let's use the timestamp to see how we can use these operators to select different documents:

    [19]:
    result = nairobi.find({"timestamp": {"$gt": datetime.datetime(2018, 9, 1)}}).limit(3)
    [    {    'P1': 39.67,
              '_id': ObjectId('6334b0e98c51459f9b198d27'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
         {    'P1': 39.13,
              '_id': ObjectId('6334b0e98c51459f9b198d28'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
         {    'P1': 30.07,
              '_id': ObjectId('6334b0e98c51459f9b198d29'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]
    
    [20]:
    result = nairobi.find({"timestamp": {"$lt": datetime.datetime(2018, 12, 1)}}).limit(3)
    [    {    'P1': 39.67,
              '_id': ObjectId('6334b0e98c51459f9b198d27'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
         {    'P1': 39.13,
              '_id': ObjectId('6334b0e98c51459f9b198d28'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
         {    'P1': 30.07,
              '_id': ObjectId('6334b0e98c51459f9b198d29'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]
    
    [21]:
        {"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}
    [    {    'P1': 39.67,
              '_id': ObjectId('6334b0e98c51459f9b198d27'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
         {    'P2': 34.43,
              '_id': ObjectId('6334b0ea8c51459f9b1a0db2'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P2',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}]
    
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Find three documents with timestamp greater than or equal to and less than or equal the date December 12, 2018 — datetime.datetime(2018, 12, 1, 0, 0, 6, 767000).

    [33]:
    result = nairobi.find({"timestamp":{"$gte":datetime.datetime(2018,12,1,0,0,6,767000)}}).limit(3)
    [    {    'P1': 17.08,
              '_id': ObjectId('6334b0e98c51459f9b19eba8'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 12, 1, 0, 0, 6, 767000)},
         {    'P1': 17.62,
              '_id': ObjectId('6334b0e98c51459f9b19eba9'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 12, 1, 0, 5, 6, 327000)},
         {    'P1': 11.05,
              '_id': ObjectId('6334b0e98c51459f9b19ebaa'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 12, 1, 0, 10, 5, 579000)}]
    
    [ ]:
    # Less than or equal to
    ## Updating Documents

    Updating Documents¶

    We can also update documents by passing some filter and new values using `update_one` to update one record or `update_many` to update many records. Let's look at an example. Before updating, we have this record showing like this:

    We can also update documents by passing some filter and new values using update_one to update one record or update_many to update many records. Let's look at an example. Before updating, we have this record showing like this:

    [34]:
        {"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}
    [    {    'P1': 39.67,
              '_id': ObjectId('6334b0e98c51459f9b198d27'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}]
    
    Now we are updating the sensor type from `"SDS011"` to `"SDS"`, we first select all records with sensor type equal to `"SDS011"`, then set the new value to `"SDS"`:

    Now we are updating the sensor type from "SDS011" to "SDS", we first select all records with sensor type equal to "SDS011", then set the new value to "SDS":

    [35]:
        {"metadata.sensor_type": {"$eq": "SDS101"}},
    Now we can see all records have changed:

    Now we can see all records have changed:

    [36]:
        {"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}
    [    {    'P1': 39.67,
              '_id': ObjectId('6334b0e98c51459f9b198d27'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
         {    'P2': 34.43,
              '_id': ObjectId('6334b0ea8c51459f9b1a0db2'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P2',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}]
    
    We can change it back:

    We can change it back:

    [37]:
        {"$set": {"metadata.sensor_type": "SDS101"}},
    [38]:
    result.raw_result
    [38]:
    {'n': 0, 'nModified': 0, 'ok': 1.0, 'updatedExisting': False}
    ## Aggregation

    Aggregation¶

    Since we're looking for *big* numbers, we need to figure out which one of those dimensions has the largest number of measurements by **aggregating** the data in each document. Since we already know that `site 3` has significantly more documents than `site 2`, let's start looking at `site 3`. We can use the `$match` syntax to only select `site 3` data. The code to do that looks like this: 

    Since we're looking for big numbers, we need to figure out which one of those dimensions has the largest number of measurements by aggregating the data in each document. Since we already know that site 3 has significantly more documents than site 2, let's start looking at site 3. We can use the $match syntax to only select site 3 data. The code to do that looks like this:

    [39]:
            {"$group": {"_id": "$metadata.measurement", "count": {"$count": {}}}},
    [    {'_id': 'temperature', 'count': 35147},
         {'_id': 'P1', 'count': 35146},
         {'_id': 'P2', 'count': 35146},
         {'_id': 'humidity', 'count': 35147}]
    
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Find the number of each measurement type at site 29 in Nairobi.

    [ ]:
    pp.pprint(list(result))
    After aggregation, there is another useful operator called `$project`, which allows you to specify which fields to display by adding new fields or deleting fields. Using the Nairobi data from site 29, we can first count each sensor type:<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

    After aggregation, there is another useful operator called $project, which allows you to specify which fields to display by adding new fields or deleting fields. Using the Nairobi data from site 29, we can first count each sensor type:WQU WorldQuant University Applied Data Science Lab QQQQ

    [42]:
            {"$group": {"_id": "$metadata.sensor_type", "count": {"$count": {}}}},
    [{'_id': 'DHT22', 'count': 66038}, {'_id': 'SDS011', 'count': 65814}]
    
    We can see there are two sensor types and the corresponding counts. If we only want to display what are the types but do not care about the counts, we can suppress the `count` filed by setting it at 0 in `$project`:

    We can see there are two sensor types and the corresponding counts. If we only want to display what are the types but do not care about the counts, we can suppress the count filed by setting it at 0 in $project:

    [43]:
            {"$group": {"_id": "$metadata.sensor_type", "count": {"$count": {}}}},
    [{'_id': 'DHT22'}, {'_id': 'SDS011'}]
    
    The `$project` syntax is also useful for deleting the intermediate fields that we used to generate our final fields but no longer need. In the following example, let's calculate the date difference for each sensor type. We'll first use the aggregation method to get the start date and last date. 

    The $project syntax is also useful for deleting the intermediate fields that we used to generate our final fields but no longer need. In the following example, let's calculate the date difference for each sensor type. We'll first use the aggregation method to get the start date and last date.

    [44]:
                    "date_min": {"$min": "$timestamp"},
    [    {    '_id': 'DHT22',
              'date_max': datetime.datetime(2018, 12, 31, 23, 57, 37, 257000),
              'date_min': datetime.datetime(2018, 9, 1, 0, 0, 4, 301000)},
         {    '_id': 'SDS011',
              'date_max': datetime.datetime(2018, 12, 31, 23, 55, 4, 811000),
              'date_min': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}]
    
    Then we can calculate the date difference using `$dateDiff`, which gets the date difference through specifying the start date, end date and unit for timestamp data. We can see from the results above that the dates, are very close to each other. The only differences are in the minutes, so we can specify the unit as minute to show the difference. Since we don't need the start date and end dates, we can define a `"dateDiff"` field inside `$project`, so that it will be shown in the final display: 

    Then we can calculate the date difference using $dateDiff, which gets the date difference through specifying the start date, end date and unit for timestamp data. We can see from the results above that the dates, are very close to each other. The only differences are in the minutes, so we can specify the unit as minute to show the difference. Since we don't need the start date and end dates, we can define a "dateDiff" field inside $project, so that it will be shown in the final display:

    [45]:
                    "date_min": {"$min": "$timestamp"},
    [{'_id': 'DHT22', 'dateDiff': 175677}, {'_id': 'SDS011', 'dateDiff': 175675}]
    
    If we specify unit as `day`, it will show the difference between the dates:

    If we specify unit as day, it will show the difference between the dates:

    [ ]:
                    "date_min": {"$min": "$timestamp"},
    <font size="+1">Practice</font>

    Practice

    Try it yourself find the date difference for each measurement type at site 29 in Nairobi.

    [ ]:
    pp.pprint(list(result))
    We can do more with the date data using `$dateTrunc`, which truncates datetime data. We need to specify the datetime data, which can be a `Date`, a `Timestamp`, or an `ObjectID`. Then we need to specify the `unit` (year, month, day, hour, minute, second) and `binSize` (numerical variable defining the size of the truncation). Let's check the example below, where we group data by the month using `$dateTrunc` and then count how many observations there are for each month.

    We can do more with the date data using $dateTrunc, which truncates datetime data. We need to specify the datetime data, which can be a Date, a Timestamp, or an ObjectID. Then we need to specify the unit (year, month, day, hour, minute, second) and binSize (numerical variable defining the size of the truncation). Let's check the example below, where we group data by the month using $dateTrunc and then count how many observations there are for each month.

    [ ]:
                                "date": "$timestamp",
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Truncate date by week and count at site 29 in Nairobi.

    [ ]:
    pp.pprint(list(result))
    ## Finishing Up

    Finishing Up¶

    So far, we've connected to a server, accessed that server with a client, found the collection we were looking for within a database, and explored that collection in all sorts of different ways. Now it's time to get the data we'll actually need to build a model, and store that in a way we'll be able to use.

    Let's use find to retrieve the PM 2.5 data from site 3. And, since we don't need any of the metadata to build our model, let's strip that out using the projection argument. In this case, we're telling the collection that we only want to see "timestamp" and "P2". Keep in mind that we limited the number of records we'll get back to 3 when we defined result above.

    [40]:
        # `projection` limits the kinds of data we'll get back.
    {'P2': 0.01, 'timestamp': datetime.datetime(2018, 10, 1, 0, 0, 52, 906000)}
    
    Finally, we'll use pandas to read the extracted data into a DataFrame, making sure to set `timestamp` as the index:

    Finally, we'll use pandas to read the extracted data into a DataFrame, making sure to set timestamp as the index:

    [41]:
    df = pd.DataFrame(result).set_index("timestamp")
    [41]:
    P2
    timestamp
    2018-10-01 00:04:20.554 0.01
    2018-10-01 00:07:47.504 0.01
    2018-10-01 00:11:14.382 0.01
    2018-10-01 00:14:41.239 0.01
    2018-10-01 00:18:05.938 0.01
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Retrieve the PM 2.5 data from site 29 in Nairobi and strip out the metadata to create a DataFrame that shows only timestamp and P2. Print the result.

    [ ]:
    result = ...
    # References & Further Reading

    References & Further Reading¶

    • Further reading about servers and clients
    • Definitions from the MongoDB documentation
    • Information on Iterators
    • MongoDB documentation in Aggregation
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    ​x
     

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    <font size="+3"><strong>3.4. ARMA Models</strong></font>

    3.4. ARMA Models

    [1]:
    x
    import inspect
    import time
    import warnings
    ​
    import matplotlib.pyplot as plt
    import pandas as pd
    import plotly.express as px
    import seaborn as sns
    from IPython.display import VimeoVideo
    from pymongo import MongoClient
    from sklearn.metrics import mean_absolute_error
    from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
    from statsmodels.tsa.arima.model import ARIMA
    ​
    warnings.filterwarnings("ignore")
    /opt/conda/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
      from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
    /opt/conda/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
      from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
    
    [2]:
     
    VimeoVideo("665851728", h="95c59d2805", width=600)
    [2]:
    # Prepare Data

    Prepare Data¶

    ## Import

    Import¶

    **Task 3.4.1:** Create a client to connect to the MongoDB server, then assign the `"air-quality"` database to `db`, and the `"nairobi"` collection to `nairobi`.

    Task 3.4.1: Create a client to connect to the MongoDB server, then assign the "air-quality" database to db, and the "nairobi" collection to nairobi.

    • Create a client object for a MongoDB instance.
    • Access a database using PyMongo.
    • Access a collection in a database using PyMongo.
    [3]:
     
    client = MongoClient(host="localhost",port=27017)
    db = client["air-quality"]
    nairobi = db["nairobi"]
    [4]:
     
    def wrangle(collection,resample_rule="1H"):
    ​
        results = collection.find(
            {"metadata.site": 29, "metadata.measurement": "P2"},
            projection={"P2": 1, "timestamp": 1, "_id": 0},
        )
    ​
        # Read results into DataFrame
        df = pd.DataFrame(list(results)).set_index("timestamp")
    ​
        # Localize timezone
        df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")
    ​
        # Remove outliers
        df = df[df["P2"] < 500]
    ​
        # Resample and forward-fill
        y = df["P2"].resample(resample_rule).mean().fillna(method="ffill")
    ​
        return y
    [5]:
     
    VimeoVideo("665851670", h="3efc0c20d4", width=600)
    [5]:
    **Task 3.4.2:** Change your `wrangle` function so that it has a `resample_rule` argument that allows the user to change the resampling interval. The argument default should be `"1H"`.

    Task 3.4.2: Change your wrangle function so that it has a resample_rule argument that allows the user to change the resampling interval. The argument default should be "1H".

    • What's an argument?
    • Include an argument in a function in Python.
    [6]:
     
    # Check your work
    func_params = set(inspect.signature(wrangle).parameters.keys())
    assert func_params == set(
        ["collection", "resample_rule"]
    ), f"Your function should take two arguments: `'collection'`, `'resample_rule'`. Your function takes the following arguments: {func_params}"
    **Task 3.4.3:** Use your wrangle function to read the data from the `nairobi` collection into the Series `y`.

    Task 3.4.3: Use your wrangle function to read the data from the nairobi collection into the Series y.

    [7]:
     
    y = wrangle(nairobi,resample_rule="1H")
    y.head()
    [7]:
    timestamp
    2018-09-01 03:00:00+03:00    17.541667
    2018-09-01 04:00:00+03:00    15.800000
    2018-09-01 05:00:00+03:00    11.420000
    2018-09-01 06:00:00+03:00    11.614167
    2018-09-01 07:00:00+03:00    17.665000
    Freq: H, Name: P2, dtype: float64
    [8]:
     
    # Check your work
    assert isinstance(y, pd.Series), f"`y` should be a Series, not a {type(y)}."
    assert len(y) == 2928, f"`y` should have 2,928 observations, not {len(y)}."
    assert (
        y.isnull().sum() == 0
    ), f"There should be no null values in `y`. Your `y` has {y.isnull().sum()} null values."
    ## Explore

    Explore¶

    [9]:
     
    VimeoVideo("665851654", h="687ff8d5ee", width=600)
    [9]:
    **Task 3.4.4:** Create an ACF plot for the data in `y`. Be sure to label the x-axis as `"Lag [hours]"` and the y-axis as `"Correlation Coefficient"`.

    Task 3.4.4: Create an ACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient".

    • What's an ACF plot?
    • Create an ACF plot using statsmodels
    [10]:
     
    fig,ax=plt.subplots(figsize=(15,6))
    plot_acf(y,ax=ax)
    plt.xlabel("lag in hours")
    plt.ylabel("Correlation coefficient")
    [10]:
    Text(0, 0.5, 'Correlation coefficient')
    [11]:
     
    VimeoVideo("665851644", h="e857f05bfb", width=600)
    [11]:
    **Task 3.4.5:** Create an PACF plot for the data in `y`. Be sure to label the x-axis as `"Lag [hours]"` and the y-axis as `"Correlation Coefficient"`.

    Task 3.4.5: Create an PACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient".

    • What's a PACF plot?
    • Create an PACF plot using statsmodels
    [12]:
     
    fig,ax = plt.subplots(figsize=(15,6))
    plot_pacf(y,ax=ax)
    plt.xlabel("Lag [hours]")
    plt.ylabel("Correlation coefficient")
    [12]:
    Text(0, 0.5, 'Correlation coefficient')
    ## Split

    Split¶

    **Task 3.4.6:** Create a training set `y_train` that contains only readings from October 2018, and a test set `y_test` that contains readings from November 1, 2018.

    Task 3.4.6: Create a training set y_train that contains only readings from October 2018, and a test set y_test that contains readings from November 1, 2018.

    • Subset a DataFrame by selecting one or more rows in pandas.
    [13]:
     
    # print(len(y))
    y_train = y.get("October 2018")
    y_test = y.get("November 1, 2018")
    [14]:
     
    print(len(y_test))
    24
    
    [15]:
     
    # Check your work
    assert (
        len(y_train) == 744
    ), f"`y_train` should have 744 observations, not {len(y_train)}."
    assert len(y_test) == 24, f"`y_test` should have 24 observations, not {len(y_test)}."
    # Build Model

    Build Model¶

    ## Baseline

    Baseline¶

    **Task 3.4.7:** Calculate the baseline mean absolute error for your model.

    Task 3.4.7: Calculate the baseline mean absolute error for your model.

    [16]:
     
    y_train_mean = y_train.mean()
    y_pred_baseline = [y_train_mean] * len(y_train)
    mae_baseline = mean_absolute_error(y_train, y_pred_baseline)
    print("Mean P2 Reading:", round(y_train_mean, 2))
    print("Baseline MAE:", round(mae_baseline, 2))
    Mean P2 Reading: 10.12
    Baseline MAE: 4.17
    
    ## Iterate

    Iterate¶

    [17]:
     
    VimeoVideo("665851576", h="36e2dc6269", width=600)
    [17]:
    **Task 3.4.8:** Create ranges for possible $p$ and $q$ values. `p_params` should range between 0 and 25, by steps of 8. `q_params` should range between 0 and 3 by steps of 1.

    Task 3.4.8: Create ranges for possible 𝑝p and 𝑞q values. p_params should range between 0 and 25, by steps of 8. q_params should range between 0 and 3 by steps of 1.

    • What's a hyperparameter?
    • What's an iterator?
    • Create a range in Python.
    [18]:
     
    p_params = range(0,25,8)
    q_params = range(0,3,1)
    list(p_params)
    [18]:
    [0, 8, 16, 24]
    [19]:
     
    VimeoVideo("665851476", h="d60346ed30", width=600)
    [19]:
    **Task 3.4.9:** Complete the code below to train a model with every combination of hyperparameters in `p_params` and `q_params`. Every time the model is trained, the mean absolute error is calculated and then saved to a dictionary. If you're not sure where to start, do the code-along with Nicholas!

    Task 3.4.9: Complete the code below to train a model with every combination of hyperparameters in p_params and q_params. Every time the model is trained, the mean absolute error is calculated and then saved to a dictionary. If you're not sure where to start, do the code-along with Nicholas!

    • What's an ARMA model?
    • Append an item to a list in Python.
    • Calculate the mean absolute error for a list of predictions in scikit-learn.
    • Instantiate a predictor in statsmodels.
    • Train a model in statsmodels.
    • Write a for loop in Python.
    [20]:
     
    # Create dictionary to store MAEs
    mae_grid = dict()
    # Outer loop: Iterate through possible values for `p`
    for p in p_params:
        # Create key-value pair in dict. Key is `p`, value is empty list.
        mae_grid[p] = list()
        # Inner loop: Iterate through possible values for `q`
        for q in q_params:
            # Combination of hyperparameters for model
            order = (p, 0, q)
            # Note start time
            start_time = time.time()
            # Train model
            model = ARIMA(y_train, order=order).fit()
            # Calculate model training time
            elapsed_time = round(time.time() - start_time, 2)
            print(f"Trained ARIMA {order} in {elapsed_time} seconds.")
            # Generate in-sample (training) predictions
            y_pred = model.predict()
            # Calculate training MAE
            mae = mean_absolute_error(y_train,y_pred)
            # Append MAE to list in dictionary
            mae_grid[p].append(mae)
    ​
    print()
    print(mae_grid)
    Trained ARIMA (0, 0, 0) in 0.55 seconds.
    Trained ARIMA (0, 0, 1) in 0.35 seconds.
    Trained ARIMA (0, 0, 2) in 1.1 seconds.
    Trained ARIMA (8, 0, 0) in 10.3 seconds.
    Trained ARIMA (8, 0, 1) in 42.8 seconds.
    Trained ARIMA (8, 0, 2) in 80.41 seconds.
    Trained ARIMA (16, 0, 0) in 42.9 seconds.
    Trained ARIMA (16, 0, 1) in 159.28 seconds.
    Trained ARIMA (16, 0, 2) in 220.81 seconds.
    Trained ARIMA (24, 0, 0) in 122.58 seconds.
    Trained ARIMA (24, 0, 1) in 158.0 seconds.
    Trained ARIMA (24, 0, 2) in 288.89 seconds.
    
    {0: [4.171460443827197, 3.3506427433555537, 3.1057222587888647], 8: [2.9384480570404223, 2.9149008203151663, 2.908046094677133], 16: [2.9201084726122, 2.929436207635203, 2.913638894139686], 24: [2.914390327277196, 2.9136013252035013, 2.89792511429827]}
    
    [21]:
     
    VimeoVideo("665851464", h="12f4080d0b", width=600)
    [21]:
    **Task 3.4.10:** Organize all the MAE's from above in a DataFrame names `mae_df`. Each row represents a possible value for $q$ and each column represents a possible value for $p$. 

    Task 3.4.10: Organize all the MAE's from above in a DataFrame names mae_df. Each row represents a possible value for 𝑞q and each column represents a possible value for 𝑝p.

    • Create a DataFrame from a dictionary using pandas.
    [22]:
     
    mae_grid
    mae_df = pd.DataFrame(mae_grid)
    mae_df.round(4)
    [22]:
    0 8 16 24
    0 4.1715 2.9384 2.9201 2.9144
    1 3.3506 2.9149 2.9294 2.9136
    2 3.1057 2.9080 2.9136 2.8979
    [23]:
     
    VimeoVideo("665851453", h="dfd415bc08", width=600)
    [23]:
    **Task 3.4.11:** Create heatmap of the values in `mae_grid`. Be sure to label your x-axis `"p values"` and your y-axis `"q values"`. 

    Task 3.4.11: Create heatmap of the values in mae_grid. Be sure to label your x-axis "p values" and your y-axis "q values".

    • Create a heatmap in seaborn.
    [24]:
     
    sns.heatmap(mae_df,cmap="Blues")
    plt.xlabel("p values")
    plt.ylabel("q values")
    ​
    [24]:
    Text(33.0, 0.5, 'q values')
    [25]:
     
    VimeoVideo("665851444", h="8b58161f26", width=600)
    [25]:
    **Task 3.4.12:** Use the [`plot_diagnostics`](https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMAResults.plot_diagnostics.html) method to check the residuals for your model. Keep in mind that the plot will represent the residuals from the last model you trained, so make sure it was your best model, too!

    Task 3.4.12: Use the plot_diagnostics method to check the residuals for your model. Keep in mind that the plot will represent the residuals from the last model you trained, so make sure it was your best model, too!

    • Examine time series model residuals using statsmodels.
    [26]:
     
    fig, ax = plt.subplots(figsize=(15, 12))
    model.plot_diagnostics(fig=fig)
    ​
    [26]:
    ## Evaluate

    Evaluate¶

    [27]:
     
    VimeoVideo("665851439", h="c48d80cdf4", width=600)
    [27]:
    **Task 3.4.13:** Complete the code below to perform walk-forward validation for your model for the entire test set `y_test`. Store your model's predictions in the Series `y_pred_wfv`. Choose the values for $p$ and $q$ that best balance model performance and computation time. Remember: This model is going to have to train 24 times before you can see your test MAE!<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

    Task 3.4.13: Complete the code below to perform walk-forward validation for your model for the entire test set y_test. Store your model's predictions in the Series y_pred_wfv. Choose the values for 𝑝p and 𝑞q that best balance model performance and computation time. Remember: This model is going to have to train 24 times before you can see your test MAE!WQU WorldQuant University Applied Data Science Lab QQQQ

    [29]:
     
    y_pred_wfv = pd.Series()
    history = y_train.copy()
    for i in range(len(y_test)):
        model = ARIMA(history,order=(8,0,1)).fit()
        next_pred = model.forecast()
        y_pred_wfv = y_pred_wfv.append(next_pred)
        history = history.append(y_test[next_pred.index])
    y_pred_wfv
    [29]:
    2018-11-01 00:00:00+03:00     8.725735
    2018-11-01 01:00:00+03:00     7.988511
    2018-11-01 02:00:00+03:00     7.133804
    2018-11-01 03:00:00+03:00     6.372284
    2018-11-01 04:00:00+03:00     7.539867
    2018-11-01 05:00:00+03:00     8.538572
    2018-11-01 06:00:00+03:00    10.757824
    2018-11-01 07:00:00+03:00    10.611658
    2018-11-01 08:00:00+03:00     9.914120
    2018-11-01 09:00:00+03:00    10.911411
    2018-11-01 10:00:00+03:00     9.094045
    2018-11-01 11:00:00+03:00     7.250349
    2018-11-01 12:00:00+03:00     6.515896
    2018-11-01 13:00:00+03:00     7.184177
    2018-11-01 14:00:00+03:00     6.864486
    2018-11-01 15:00:00+03:00     7.564168
    2018-11-01 16:00:00+03:00     8.732164
    2018-11-01 17:00:00+03:00     8.295855
    2018-11-01 18:00:00+03:00     8.199158
    2018-11-01 19:00:00+03:00     8.293687
    2018-11-01 20:00:00+03:00     9.204907
    2018-11-01 21:00:00+03:00     9.629578
    2018-11-01 22:00:00+03:00     9.658772
    2018-11-01 23:00:00+03:00    10.351504
    Freq: H, dtype: float64
    [30]:
     
     test_mae = mean_absolute_error(y_test,y_pred_wfv)
    print("Test MAE (walk forward validation):", round(test_mae, 2))
    Test MAE (walk forward validation): 1.67
    
    # Communicate Results

    Communicate Results¶

    [31]:
     
    VimeoVideo("665851423", h="8236ff348f", width=600)
    [31]:
    **Task 3.4.14:** First, generate the list of training predictions for your model. Next, create a DataFrame `df_predictions` with the true values `y_test` and your predictions `y_pred_wfv` (don't forget the index). Finally, plot `df_predictions` using plotly express. Make sure that the y-axis is labeled `"P2"`. 

    Task 3.4.14: First, generate the list of training predictions for your model. Next, create a DataFrame df_predictions with the true values y_test and your predictions y_pred_wfv (don't forget the index). Finally, plot df_predictions using plotly express. Make sure that the y-axis is labeled "P2".

    • Generate in-sample predictions for a model in statsmodels.
    • Create a DataFrame from a dictionary using pandas.
    • Create a line plot in pandas.
    [32]:
     
    df_predictions = pd.DataFrame(
        {"y_test":y_test,"y_pred_wfv":y_pred_wfv}
    ​
    )
    fig = px.line(df_predictions,labels={"Value":"PM2.5"})
    fig.show()
    00:00Nov 1, 201803:0006:0009:0012:0015:0018:0021:00681012
    variabley_testy_pred_wfvindexvalue
    plotly-logomark
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    xxxxxxxxxx
    ​

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    xxxxxxxxxx
    <font size="+3"><strong>Pandas: Advanced</strong></font>

    Pandas: Advanced

    xxxxxxxxxx
    # Calculate Summary Statistics for a DataFrame or Series

    Calculate Summary Statistics for a DataFrame or Series¶

    xxxxxxxxxx
    Many datasets are large, and it can be helpful to get a summary of information in them. Let's load a dataset and examine the first few rows:

    Many datasets are large, and it can be helpful to get a summary of information in them. Let's load a dataset and examine the first few rows:

    [ ]:
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
    xxxxxxxxxx
    Let's get a summary description of this dataset.

    Let's get a summary description of this dataset.

    [ ]:
    mexico_city1.describe()
    xxxxxxxxxx
    Like most large datasets, this one has many values which are missing. The describe function will ignore missing values in each column. You can also remove rows and columns with missing values, and then get a summary of the data that's still there. We need to remove columns first, before removing the rows; the sequence of operations here is important. The code looks like this:

    Like most large datasets, this one has many values which are missing. The describe function will ignore missing values in each column. You can also remove rows and columns with missing values, and then get a summary of the data that's still there. We need to remove columns first, before removing the rows; the sequence of operations here is important. The code looks like this:

    [ ]:
        ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
    xxxxxxxxxx
    Let's take a look at our new, cleaner dataset.

    Let's take a look at our new, cleaner dataset.

    [ ]:
    mexico_city1.describe()
    xxxxxxxxxx
    <font size="+1">Practice</font> 

    Practice

    Reload the mexico-city-real-estate-1.csv dataset. Reverse the sequence of operations by first dropping all rows where there is a missing value, and then dropping the columns, floor, price_usd_per_m2,expenses and rooms. What is the size of the resulting DataFrame?

    [ ]:
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
    xxxxxxxxxx
    # Select a Series from a DataFrame

    Select a Series from a DataFrame¶

    xxxxxxxxxx
    Since the datasets we work with are so large, you might want to focus on a single column of a DataFrame. Let's load up the `mexico-city-real-estate-2` dataset, and examine the first few rows to find the column names.

    Since the datasets we work with are so large, you might want to focus on a single column of a DataFrame. Let's load up the mexico-city-real-estate-2 dataset, and examine the first few rows to find the column names.

    [ ]:
    mexico_city2 = pd.read_csv("./data/mexico-city-real-estate-2.csv")
    xxxxxxxxxx
    Maybe we're interested in the `surface_covered_in_m2` column. The code to extract just that one column looks like this:

    Maybe we're interested in the surface_covered_in_m2 column. The code to extract just that one column looks like this:

    [ ]:
    surface_covered_in_m2 = mexico_city2["surface_covered_in_m2"]
    xxxxxxxxxx
    <font size="+1">Practice</font> 

    Practice

    Select the price series from the mexico-city-real-estate-2 dataset, and load it into the mexico_city2 DataFrame

    [ ]:
    print(price)
    xxxxxxxxxx
    # Subset a DataFrame by Selecting One or More Columns

    Subset a DataFrame by Selecting One or More Columns¶

    xxxxxxxxxx
    You may find it more efficient to work with a smaller portion of a dataset that's relevant to you. One way to do this is to select some columns from a DataFrame and make a new DataFrame with them. Let's load a dataset to do this and examine the first few rows to find the column headings:

    You may find it more efficient to work with a smaller portion of a dataset that's relevant to you. One way to do this is to select some columns from a DataFrame and make a new DataFrame with them. Let's load a dataset to do this and examine the first few rows to find the column headings:

    [ ]:
    mexico_city4 = pd.read_csv("./data/mexico-city-real-estate-4.csv")
    xxxxxxxxxx
    Let's choose `"operation"`, `"property_type"`, `"place_with_parent_names"`, and `"price"`:

    Let's choose "operation", "property_type", "place_with_parent_names", and "price":

    [ ]:
        ["operation", "property_type", "place_with_parent_names", "price"]
    xxxxxxxxxx
    Once we've done that, we can find the resulting number of entries in the DataFrame:

    Once we've done that, we can find the resulting number of entries in the DataFrame:

    [ ]:
    mexico_city4_subset.shape
    xxxxxxxxxx
    <font size="+1">Practice</font> 

    Practice

    Load the mexico-city-real-estate-1.csv dataset and subset it to obtain the operation, lat-lon and place_with_property_names columns only. How many entries are in the resulting DataFrame?

    [ ]:
        ["operation", "lat-lon", "place_with_parent_names"]
    xxxxxxxxxx
    # Subset the Columns of a DataFrame Based on Data Types

    Subset the Columns of a DataFrame Based on Data Types¶

    xxxxxxxxxx
    It's helpful to be able to find specific types of entries — typically numeric ones — and put all of these in a separate DataFrame. First, let's take a look at the `mexico-city-real-estate-5` dataset.

    It's helpful to be able to find specific types of entries — typically numeric ones — and put all of these in a separate DataFrame. First, let's take a look at the mexico-city-real-estate-5 dataset.

    [ ]:
    mexico_city5 = pd.read_csv("./data/mexico-city-real-estate-5.csv")
    xxxxxxxxxx
    The code to subset just the numerical entries looks like this:

    The code to subset just the numerical entries looks like this:

    [ ]:
    mexico_city5_numbers = mexico_city5.select_dtypes(include="number")
    xxxxxxxxxx
    <font size="+1">Practice</font> 

    Practice

    Create a subset of the DataFrame from mexico-city-real-estate-5 which excludes numbers.

    [ ]:
    print(mexico_city3_no_numbers.shape)
    xxxxxxxxxx
    # Working with `value_counts` in a Series

    Working with value_counts in a Series¶

    In order to use the data in a series for other types of analysis, it might be helpful to know how often each value occurs in the Series. To do that, we use the value_counts method to aggregate the data. Let's take a look at the number of properties associated with each department in the colombia-real-estate-1 dataset.

    [ ]:
    df1 = pd.read_csv("data/colombia-real-estate-1.csv", usecols=["department"])
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Aggregate the different property types in the colombia-real-estate-2 dataset.

    [ ]:
    df2 = ...
    xxxxxxxxxx
    # Series and `Groupby`

    Series and Groupby¶

    Large Series often include data points that have some attribute in common, but which are nevertheless not grouped together in the dataset. Happily, pandas has a method that will bring these data points together into groups.

    Let's take a look at the colombia-real-estate-1 dataset. The set includes properties scattered across Colombia, so it might be useful to group properties from the same department together; to do this, we'll use the groupby method. The code looks like this:

    [ ]:
    dept_group = df1.groupby("department")
    xxxxxxxxxx
    To make sure we got all the departments in the dataset, let's print the first occurrence of each department.

    To make sure we got all the departments in the dataset, let's print the first occurrence of each department.

    [ ]:
    dept_group.first()
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Group the properties in colombia-real-estate-2 by department, and print the result.

    [ ]:
    dept_group.first()
    xxxxxxxxxx
    Now that we have all the properties grouped by department, we might want to see the properties in just one of the departments. We can use the `get_group` method to do that. If we just wanted to see the properties in `"Santander"`, for example, the code would look like this: 

    Now that we have all the properties grouped by department, we might want to see the properties in just one of the departments. We can use the get_group method to do that. If we just wanted to see the properties in "Santander", for example, the code would look like this:

    [ ]:
    dept_group1 = df1.groupby("department")
    xxxxxxxxxx
    We can also make groups based on more than one category by adding them to the `groupby` method. After resetting the `df1` DataFrame, here's what the code looks like if we want to group properties both by `department` and by `property_type`.

    We can also make groups based on more than one category by adding them to the groupby method. After resetting the df1 DataFrame, here's what the code looks like if we want to group properties both by department and by property_type.

    [ ]:
    dept_group2 = df1.groupby(["department", "property_type"])
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Group the properties in colombia-real-estate-2 by department and property type, and print the result.

    [ ]:
    dept_group.first()
    xxxxxxxxxx
    Finally, it's possible to use `groupby` to calculate aggregations. For example, if we wanted to find the average property area in each department, we would use the `.mean()` method. This is what the code for that looks like:

    Finally, it's possible to use groupby to calculate aggregations. For example, if we wanted to find the average property area in each department, we would use the .mean() method. This is what the code for that looks like:

    [ ]:
    dept_group = df1.groupby("department")["area_m2"].mean()
    xxxxxxxxxx
    *Practice*

    Practice Try it yourself! Use the .mean method in the colombia-real-estate-2 dataset to find the average price in Colombian pesos ("price_cop") for properties in each "department".

    [ ]:
    dept_group = ...
    xxxxxxxxxx
    # Subset a DataFrame's Columns Based on the Column Data Types

    Subset a DataFrame's Columns Based on the Column Data Types¶

    xxxxxxxxxx
    It's helpful to be able to find entries of a certain type, typically numerical entries, and put all of these in a separate DataFrame. Let's load a dataset to see how this works:

    It's helpful to be able to find entries of a certain type, typically numerical entries, and put all of these in a separate DataFrame. Let's load a dataset to see how this works:

    [ ]:
    mexico_city5 = pd.read_csv("./data/mexico-city-real-estate-5.csv")
    xxxxxxxxxx
    Now let's get only numerical entries:

    Now let's get only numerical entries:

    [ ]:
    mexico_city5_numbers = mexico_city5.select_dtypes(include="number")
    xxxxxxxxxx
    Let's now find all entries which are not numerical entries:

    Let's now find all entries which are not numerical entries:

    [ ]:
    mexico_city5_no_numbers = mexico_city5.select_dtypes(exclude="number")
    xxxxxxxxxx
    <font size="+1">Practice</font> 

    Practice

    Create a subset of the DataFrame from mexico-city-real-estate-5.csv which excludes numbers. How many entries does it have?

    [ ]:
    print(mexico_city3_no_numbers.shape)
    xxxxxxxxxx
    # Subset a DataFrame's columns based on column names

    Subset a DataFrame's columns based on column names¶

    xxxxxxxxxx
    To subset a DataFrame by column names, either define a list of columns to include or define a list of columns to exclude. Once you've done that, you can retain or drop the columns accordingly. For example, let's suppose we want to modify the `mexico_city3` dataset and only retain the first three columns. Let's define two lists, one with the columns to retain and one with the columns to drop:

    To subset a DataFrame by column names, either define a list of columns to include or define a list of columns to exclude. Once you've done that, you can retain or drop the columns accordingly. For example, let's suppose we want to modify the mexico_city3 dataset and only retain the first three columns. Let's define two lists, one with the columns to retain and one with the columns to drop:

    [ ]:
    keep_cols = ["operation", "property_type", "place_with_parent_names"]
    xxxxxxxxxx
    Next, we'll explore both approaches to subset `mexico_city3`. To retain columns based on `keep_cols`:

    Next, we'll explore both approaches to subset mexico_city3. To retain columns based on keep_cols:

    [ ]:
    mexico_city3_subsetted = mexico_city3[keep_cols]
    xxxxxxxxxx
    To drop columns in `drop_cols`:

    To drop columns in drop_cols:

    [ ]:
    mexico_city3_subsetted = mexico_city3.drop(columns=drop_cols)
    xxxxxxxxxx
    <font size="+1">Practice</font> 

    Practice

    Create a subset of the DataFrame from mexico-city-real-estate-3.csv which excludes the last two columns.

    [ ]:
    ​
    xxxxxxxxxx
    ## Pivot Tables

    Pivot Tables¶

    A pivot table allows us to aggregate and summarize a DataFrame across multiple variables. For example, let's suppose we wanted to calculate the mean of the price column in the mexico_city3 dataset for the different values in the property_type column:

    [ ]:
    mexico_city3_pivot = ...
    xxxxxxxxxx
    # Subsetting with Masks

    Subsetting with Masks¶

    Another way to create subsets from a larger dataset is through masking. Masks are ways to filter out the data you're not interested in so that you can focus on the data you are. For example, we might want to look at properties in Colombia that are bigger than 200 square meters. In order to create this subset, we'll need to use a mask.

    First, we'll reset our df1 DataFrame so that we can draw on all the data in its original form. Then we'll create a statement and then assign the result to mask.

    [ ]:
    df1 = pd.read_csv("data/colombia-real-estate-1.csv")
    xxxxxxxxxx
    Notice that `mask` is a Series of Boolean values. Where properties are smaller than 200 square meters, our statement evaluates as `False`; where they're bigger than 200, it evaluates to `True`.

    Notice that mask is a Series of Boolean values. Where properties are smaller than 200 square meters, our statement evaluates as False; where they're bigger than 200, it evaluates to True.

    Once we have our mask, we can use it to select all the rows from df1 that evaluate as True.

    [ ]:
    df1[mask].head()
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Read colombia-real-estate-2 into a DataFrame named df2, and create a mask to select all the properties that are smaller than 100 square meters.

    [ ]:
    df2[mask].head()
    xxxxxxxxxx
    We can also create masks with multiple criteria using `&` for "and" and `|` for "or." For example, here's how we would find all properties in Atlántico that are bigger than 400 square meters.

    We can also create masks with multiple criteria using & for "and" and | for "or." For example, here's how we would find all properties in Atlántico that are bigger than 400 square meters.

    [ ]:
    mask = (df1["area_m2"] > 400) & (df1["department"] == "Atlántico")
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Create a mask for df2 to select all the properties in Tolima that are smaller than 150 square meters.

    [ ]:
    df2[mask].head()
    xxxxxxxxxx
    ## Reshape a DataFrame based on column values

    Reshape a DataFrame based on column values¶

    xxxxxxxxxx
    ## What's a pivot table?

    What's a pivot table?¶

    A pivot table allows you to quickly aggregate and summarize a DataFrame using an aggregation function. For example, to build a pivot table that summarizes the mean of the price_cop column for each of the unique categories in the property_type column in df2:

    [ ]:
    pivot1 = pd.pivot_table(df2, values="price_cop", index="property_type", aggfunc=np.mean)
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Build a pivot table that summarizes the mean of the price_cop column for each of the unique departments in the department column in df2:

    [ ]:
    pivot2 = pd.pivot_table(df2, values="price_cop", index="department", aggfunc=np.mean)
    xxxxxxxxxx
    ## Combine multiple categories in a Series

    Combine multiple categories in a Series¶

    xxxxxxxxxx
    Categorical variables can be collapsed into a fewer number of categories. One approach is to retain the values of the most frequently observed values and collapse all remaining values into a single category. For example, to retain only the values of the top 10 most frequent categories in the `department` column and then collapse the other categories together, use `value_counts` to generate the count of the values:

    Categorical variables can be collapsed into a fewer number of categories. One approach is to retain the values of the most frequently observed values and collapse all remaining values into a single category. For example, to retain only the values of the top 10 most frequent categories in the department column and then collapse the other categories together, use value_counts to generate the count of the values:

    [ ]:
    df2["department"].value_counts()
    xxxxxxxxxx
    Next, select just the top 10 using the `head()` method, and select the column names by using the `index` attribute of the series:

    Next, select just the top 10 using the head() method, and select the column names by using the index attribute of the series:

    [ ]:
    top_10 = df2["department"].value_counts().head(10).index
    xxxxxxxxxx
    Finally, use the apply method and a lambda function to select only the values from the `department` column and collapse the remaining values into the value `Other`:

    Finally, use the apply method and a lambda function to select only the values from the department column and collapse the remaining values into the value Other:

    [ ]:
    df2["department"] = df2["department"].apply(lambda x: x if x in top_10 else "Other")
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Retain the remaining top 5 most frequent values in the department column and collapse the remaining values into the category Other.

    [ ]:
    ​
    xxxxxxxxxx
    # Cross Tabulation

    Cross Tabulation¶

    xxxxxxxxxx
    The pandas `crosstab` function is a useful for working with grouped summary statistics for **categorical** data. It starts by picking two categorical columns, then defines one as the index and the other as the column. If the aggregate function and value column is not defined, `crosstab` will simply calculate the frequency of each combination by default. Let's see the example below from the Colombia real estate dataset.

    The pandas crosstab function is a useful for working with grouped summary statistics for categorical data. It starts by picking two categorical columns, then defines one as the index and the other as the column. If the aggregate function and value column is not defined, crosstab will simply calculate the frequency of each combination by default. Let's see the example below from the Colombia real estate dataset.

    [ ]:
    df = pd.read_csv("data/colombia-real-estate-1.csv")
    xxxxxxxxxx
    The following function calculates the frequency of each combination for two variables, `department` and `property_type`, in the data set.

    The following function calculates the frequency of each combination for two variables, department and property_type, in the data set.

    [ ]:
    pd.crosstab(index=df["department"], columns=df["property_type"])
    xxxxxxxxxx
    From the previous example, you can see we have created a DataFrame with the index showing unique observation in variable `department`, while the columns are the unique observation for the variable `property_type`. Each cell shows how many data points there are for each combination of `department` type and `property_type`. For example, there are 8 `apartments` in department `Antioquia`. 

    From the previous example, you can see we have created a DataFrame with the index showing unique observation in variable department, while the columns are the unique observation for the variable property_type. Each cell shows how many data points there are for each combination of department type and property_type. For example, there are 8 apartments in department Antioquia.

    xxxxxxxxxx
    We can further specify a value column and an aggregate function, like in `pivot_table`, to conduct more complicated calculations for the two categorical variables. In the following example, we're looking at the average area size for different property types in the departments in Colombia.

    We can further specify a value column and an aggregate function, like in pivot_table, to conduct more complicated calculations for the two categorical variables. In the following example, we're looking at the average area size for different property types in the departments in Colombia.

    [ ]:
        columns=df["property_type"],
    xxxxxxxxxx
    <font size="+1">Practice</font> 

    Practice

    Create a cross tabulate calculating frequency of combinations from mexico-city-real-estate-3.csv using currency as the index and property_type as the column.

    [ ]:
    # Create `crosstab`
    xxxxxxxxxx
    # Applying Functions to DataFrames and Series

    Applying Functions to DataFrames and Series¶

    xxxxxxxxxx
    `apply` is a useful method for to using one function on all the rows or columns of a DataFrame efficiently. Let's take the following real estate dataset as an example:

    apply is a useful method for to using one function on all the rows or columns of a DataFrame efficiently. Let's take the following real estate dataset as an example:

    [ ]:
    df = pd.read_csv("data/colombia-real-estate-2.csv", usecols=["area_m2", "price_cop"])
    xxxxxxxxxx
    By specifying the function inside `apply()`, we can transform the whole DataFrame. For example, I am calculating the square root of each row at each column:

    By specifying the function inside apply(), we can transform the whole DataFrame. For example, I am calculating the square root of each row at each column:

    [ ]:
    import numpy as np
    xxxxxxxxxx
    Note you can also specify the `"axis"` inside `apply`. By default, we have `axis=0`, which means we are applying the function to each column. We can also switch to `axis=1` if we want to apply the function to each row. See the following example showing the sum of all rows for each column:

    Note you can also specify the "axis" inside apply. By default, we have axis=0, which means we are applying the function to each column. We can also switch to axis=1 if we want to apply the function to each row. See the following example showing the sum of all rows for each column:

    [ ]:
    df.apply(np.sum, axis=0)
    xxxxxxxxxx
    The following code will get the sum of all columns for each row:

    The following code will get the sum of all columns for each row:

    [ ]:
    df.apply(np.sum, axis=1)
    xxxxxxxxxx
    By specifying the column name, we can also apply the function to a specific column or columns. Note that we can also specify index (row names) to only apply functions to specific rows, however, it is not common in practice.

    By specifying the column name, we can also apply the function to a specific column or columns. Note that we can also specify index (row names) to only apply functions to specific rows, however, it is not common in practice.

    [ ]:
    df["area_m2"].apply(np.sqrt)
    xxxxxxxxxx
    We can assign the results to a new column:

    We can assign the results to a new column:

    [ ]:
    df["area_sqrt"] = df["area_m2"].apply(np.sqrt)
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Create a new column named 'sum_columns', which is the sum of all numerical columns in df:

    [ ]:
    df["sum_columns"] = ...
    xxxxxxxxxx
    `df.aggregate()`, or `df.agg()`, shares the same concept as `df.apply()` in terms of applying functions to a DataFrame, but `df.aggregate()` can only apply aggregate functions like `sum`, `mean`, `min`, `max`, etc. See the following example for more details:

    df.aggregate(), or df.agg(), shares the same concept as df.apply() in terms of applying functions to a DataFrame, but df.aggregate() can only apply aggregate functions like sum, mean, min, max, etc. See the following example for more details:

    [ ]:
    df = pd.read_csv("data/colombia-real-estate-2.csv", usecols=["area_m2", "price_cop"])
    xxxxxxxxxx
    We can check what's the minimum number for each column calling the `min` aggregate function:

    We can check what's the minimum number for each column calling the min aggregate function:

    [ ]:
    df.agg("min")
    xxxxxxxxxx
    Like `apply()`, we can also specify the `axis` argument to switch axis:

    Like apply(), we can also specify the axis argument to switch axis:

    [ ]:
    df.agg("min", axis=1)
    xxxxxxxxxx
    We can apply aggregate function to a whole DataFrame using `df.agg()`, or specify the column name for a subset of DataFrame:

    We can apply aggregate function to a whole DataFrame using df.agg(), or specify the column name for a subset of DataFrame:

    [ ]:
    df["area_m2"].agg("min")
    xxxxxxxxxx
    We can also apply a list of aggregate functions to a DataFrame:

    We can also apply a list of aggregate functions to a DataFrame:

    [ ]:
    df.agg(["sum", "max"])
    xxxxxxxxxx
    The syntax above will calculate both `sum` and `max` for each column, and store the result as index. Besides, we can also apply different aggregate functions to different columns. In this case, we need to pass a dictionary specifying key as column names, and value as corresponding aggregate function names:

    The syntax above will calculate both sum and max for each column, and store the result as index. Besides, we can also apply different aggregate functions to different columns. In this case, we need to pass a dictionary specifying key as column names, and value as corresponding aggregate function names:

    [ ]:
    df.agg({"area_m2": "sum", "price_cop": "min"})
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Find the minimum for column area_m2 and the maximum for column price_cop using df.agg():

    [ ]:
    ​
    xxxxxxxxxx
    # Working with Dates

    Working with Dates¶

    xxxxxxxxxx
    ## Time Stamps

    Time Stamps¶

    xxxxxxxxxx
    Pandas' time series capabilities are built on the `Timestamp` class. For instance, we can transform date strings of different formats into `Timestamp`:

    Pandas' time series capabilities are built on the Timestamp class. For instance, we can transform date strings of different formats into Timestamp:

    [ ]:
    print(pd.Timestamp("January 8, 2022"))
    xxxxxxxxxx
    We can also do date math with `Timestamp`s. Note that when we subtract two dates, we get a special object called a `Timedelta`.

    We can also do date math with Timestamps. Note that when we subtract two dates, we get a special object called a Timedelta.

    [ ]:
    time_delta = pd.Timestamp("Feb. 11 2016 2:30 am") - pd.Timestamp("2015-08-03 5:14 pm")
    xxxxxxxxxx
    The pandas `Timestamp` class is also compatible with Python's `datetime` module.<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

    The pandas Timestamp class is also compatible with Python's datetime module.WQU WorldQuant University Applied Data Science Lab QQQQ

    [ ]:
    pd.Timestamp("Feb 11, 2016") - datetime.datetime(2015, 8, 3)
    xxxxxxxxxx
    We can manipulate dates using some function from `offset()`. We can use the `Day()` function to add or decrease days from a `timestamp`. The following function calculates the date that's four days prior to 9 January 2017:

    We can manipulate dates using some function from offset(). We can use the Day() function to add or decrease days from a timestamp. The following function calculates the date that's four days prior to 9 January 2017:

    [ ]:
    from pandas.tseries.offsets import BDay, BMonthEnd, Day
    xxxxxxxxxx
    In some case, you might need to add or subtract only business days. That's when you use `BDay()`:

    In some case, you might need to add or subtract only business days. That's when you use BDay():

    [ ]:
    print(pd.Timestamp("January 9, 2017") - BDay(4))
    xxxxxxxxxx
    We can also do an offset at the monthly level. For example, you can use `BMonthEnd(n)` to find the last business day $n$ months later. The following function shows the last business day of January 2017:

    We can also do an offset at the monthly level. For example, you can use BMonthEnd(n) to find the last business day 𝑛n months later. The following function shows the last business day of January 2017:

    [ ]:
    print(pd.Timestamp("January 9, 2017") + BMonthEnd(0))
    xxxxxxxxxx
    The following function shows the last business day for May 2017, which is four months after February:

    The following function shows the last business day for May 2017, which is four months after February:

    [ ]:
    print(pd.Timestamp("February 9, 2017") + BMonthEnd(4))
    xxxxxxxxxx
    We can also use the `strptime` function inside `datetime` to transform string to date format:

    We can also use the strptime function inside datetime to transform string to date format:

    [ ]:
    date = datetime.strptime("16 Oct 2022 12:14:05", "%d %b %Y %H:%M:%S")
    xxxxxxxxxx
    We can further transform it into the [ISO 8601 format](https://en.wikipedia.org/wiki/ISO_8601):

    We can further transform it into the ISO 8601 format:

    [ ]:
    date.isoformat()
    xxxxxxxxxx
    ## Date Time Indices

    Date Time Indices¶

    xxxxxxxxxx
    `DatetimeIndex` is a convenient function that transforms date-like data into readable `Datetime` format for a DataFrame index. That way, we can better plot and model time series data. Let's check an example. Here we have Apple's stock prices from 1999 10 2022. 

    DatetimeIndex is a convenient function that transforms date-like data into readable Datetime format for a DataFrame index. That way, we can better plot and model time series data. Let's check an example. Here we have Apple's stock prices from 1999 10 2022.

    [ ]:
    data.set_index("date", inplace=True)
    xxxxxxxxxx
    Even though the index looks like *dates*, it is not in *date format*. So when we plot the data, the index doesn't follow the sequence of years, and the x-axis ticks are hard to read.

    Even though the index looks like dates, it is not in date format. So when we plot the data, the index doesn't follow the sequence of years, and the x-axis ticks are hard to read.

    [ ]:
    data["open"].plot()
    xxxxxxxxxx
    You can see the index is not in date format.

    You can see the index is not in date format.

    [ ]:
    data.index[:5]
    xxxxxxxxxx
    We can use the `DatetimeIndex` function to transform it into date:

    We can use the DatetimeIndex function to transform it into date:

    [ ]:
    pd.DatetimeIndex(data.index)
    xxxxxxxxxx
    And we can set it as the index:

    And we can set it as the index:

    [ ]:
    data.index = pd.DatetimeIndex(data.index)
    xxxxxxxxxx
    Now we can see the benefit from plotting:

    Now we can see the benefit from plotting:

    [ ]:
    data["open"].plot()
    xxxxxxxxxx
    ## Date Ranges

    Date Ranges¶

    xxxxxxxxxx
    If we're entering time series data into a DataFrame, it will often be useful to create a range of dates using `date_range`. We can create it with different frequencies by specifying `freq`. Here are the days in a specific range:

    If we're entering time series data into a DataFrame, it will often be useful to create a range of dates using date_range. We can create it with different frequencies by specifying freq. Here are the days in a specific range:

    [ ]:
    pd.date_range(start="1/8/2022", end="3/2/2022", freq="D")
    xxxxxxxxxx
    Here are the business dates for a specific range:

    Here are the business dates for a specific range:

    [ ]:
    pd.date_range(start="1/8/2022", end="3/2/2022", freq="B")
    xxxxxxxxxx
    # References & Further Reading

    References & Further Reading¶

    • Pandas Documentation on Dropping a Column in a DataFrame
    • Pandas Documentation on Printing out the First Few Rows of a DataFrame
    • Pandas Documentation on Reading in a CSV File Into a DataFrame
    • Getting Started with Pandas Documentation
    • Pandas Documentation on Extracting a Column to a Series
    • Pandas Official Indexing Guide
    • Series in pandas
    • Tutorial for value_counts
    • Tutorial for groupby
    • pandas Documentation for groupby
    • Pandas Official Indexing Guide
    • Online Examples of Selecting Numeric Columns of a DataFrame
    • Official Pandas Documentation on Data Types in a DataFrame
    • Pandas documentation for Boolean indexing
    • More information on creating various kinds of subsets
    • More on boolean indexing
    • A tutorial on using various criteria to create subsets
    • Pandas.DataFrame.apply
    • Pandas.DataFrame.aggregate
    xxxxxxxxxx
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    xxxxxxxxxx
    ​

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    xxxxxxxxxx
    <font size="+3"><strong>3.5. Air Quality in Dar es Salaam 🇹🇿</strong></font>

    3.5. Air Quality in Dar es Salaam 🇹🇿

    [1]:
    warnings.simplefilter(action="ignore", category=FutureWarning)
    [21]:
    from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
    [3]:
    pp = PrettyPrinter(indent = 2)
    xxxxxxxxxx
    # Prepare Data

    Prepare Data¶

    xxxxxxxxxx
    ## Connect

    Connect¶

    xxxxxxxxxx
    **Task 3.5.1:** Connect to MongoDB server running at host `"localhost"` on port `27017`. Then connect to the `"air-quality"` database and assign the collection for Dar es Salaam to the variable name `dar`.

    Task 3.5.1: Connect to MongoDB server running at host "localhost" on port 27017. Then connect to the "air-quality" database and assign the collection for Dar es Salaam to the variable name dar.

    [4]:
    client = MongoClient(host="localhost",port=27017)
    system.views
    lagos
    system.buckets.lagos
    nairobi
    system.buckets.nairobi
    dar-es-salaam
    system.buckets.dar-es-salaam
    
    [23]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.1", [dar.name])

    You got it. Dance party time! 🕺💃🕺💃

    Score: 1

    xxxxxxxxxx
    ## Explore

    Explore¶

    xxxxxxxxxx
    **Task 3.5.2:** Determine the numbers assigned to all the sensor sites in the Dar es Salaam collection. Your submission should be a list of integers. <span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

    Task 3.5.2: Determine the numbers assigned to all the sensor sites in the Dar es Salaam collection. Your submission should be a list of integers. WQU WorldQuant University Applied Data Science Lab QQQQ

    [6]:
    sites = dar.distinct("metadata.site")
    [6]:
    [23, 11]
    [25]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.2", sites)

    Party time! 🎉🎉🎉

    Score: 1

    xxxxxxxxxx
    **Task 3.5.3:** Determine which site in the Dar es Salaam collection has the most sensor readings (of any type, not just PM2.5 readings). You submission `readings_per_site` should be a list of dictionaries that follows this format:

    Task 3.5.3: Determine which site in the Dar es Salaam collection has the most sensor readings (of any type, not just PM2.5 readings). You submission readings_per_site should be a list of dictionaries that follows this format:

    [{'_id': 6, 'count': 70360}, {'_id': 29, 'count': 131852}]
    

    Note that the values here ☝️ are from the Nairobi collection, so your values will look different.

    [7]:
            {"$group":{"_id":"$metadata.site","count":{"$count":{}}}}
    [7]:
    [{'_id': 23, 'count': 60020}, {'_id': 11, 'count': 138412}]
    [28]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.3", readings_per_site)

    Wow, you're making great progress.

    Score: 1

    xxxxxxxxxx
    ## Import

    Import¶

    xxxxxxxxxx
    **Task 3.5.4:** (5 points) Create a `wrangle` function that will extract the PM2.5 readings from the site that has the most total readings in the Dar es Salaam collection. Your function should do the following steps:

    Task 3.5.4: (5 points) Create a wrangle function that will extract the PM2.5 readings from the site that has the most total readings in the Dar es Salaam collection. Your function should do the following steps:

    1. Localize reading time stamps to the timezone for "Africa/Dar_es_Salaam".
    2. Remove all outlier PM2.5 readings that are above 100.
    3. Resample the data to provide the mean PM2.5 reading for each hour.
    4. Impute any missing values using the forward-will method.
    5. Return a Series y.
    [8]:
    result = dar.find_one({})
    { 'P1': 24.3,
      '_id': ObjectId('6334b0ef8c51459f9b1bffb7'),
      'metadata': { 'lat': -6.818,
                    'lon': 39.285,
                    'measurement': 'P1',
                    'sensor_id': 29,
                    'sensor_type': 'SDS011',
                    'site': 11},
      'timestamp': datetime.datetime(2018, 1, 1, 0, 0, 4, 53000)}
    
    [9]:
        df.index = df.index.tz_localize("UTC").tz_convert("Africa/Dar_es_Salaam")
    xxxxxxxxxx
    Use your `wrangle` function to query the `dar` collection and return your cleaned results.

    Use your wrangle function to query the dar collection and return your cleaned results.

    [10]:
    y = wrangle(dar)
    [10]:
    timestamp
    2018-01-01 03:00:00+03:00    9.456327
    2018-01-01 04:00:00+03:00    9.400833
    2018-01-01 05:00:00+03:00    9.331458
    2018-01-01 06:00:00+03:00    9.528776
    2018-01-01 07:00:00+03:00    8.861250
    Freq: H, Name: P2, dtype: float64
    [38]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.4", wrangle(dar))

    Yup. You got it.

    Score: 1

    xxxxxxxxxx
    ## Explore Some More

    Explore Some More¶

    xxxxxxxxxx
    **Task 3.5.5:** Create a time series plot of the readings in `y`. Label your x-axis `"Date"` and your y-axis `"PM2.5 Level"`. Use the title `"Dar es Salaam PM2.5 Levels"`.

    Task 3.5.5: Create a time series plot of the readings in y. Label your x-axis "Date" and your y-axis "PM2.5 Level". Use the title "Dar es Salaam PM2.5 Levels".

    [11]:
    y.plot(xlabel="Date",ylabel="PM2.5 Level",title="Dar es Salaam PM2.5 Levels",ax=ax)
    [40]:
        wqet_grader.grade("Project 3 Assessment", "Task 3.5.5", file)

    Very impressive.

    Score: 1

    xxxxxxxxxx
    **Task 3.5.6:** Plot the rolling average of the readings in `y`. Use a window size of `168` (the number of hours in a week). Label your x-axis `"Date"` and your y-axis `"PM2.5 Level"`. Use the title `"Dar es Salaam PM2.5 Levels, 7-Day Rolling Average"`.

    Task 3.5.6: Plot the rolling average of the readings in y. Use a window size of 168 (the number of hours in a week). Label your x-axis "Date" and your y-axis "PM2.5 Level". Use the title "Dar es Salaam PM2.5 Levels, 7-Day Rolling Average".

    [12]:
    y.rolling(168).mean().plot(xlabel = "Date",ylabel="PM2.5 Level",ax=ax);
    [42]:
        wqet_grader.grade("Project 3 Assessment", "Task 3.5.6", file)

    Awesome work.

    Score: 1

    xxxxxxxxxx
    **Task 3.5.7:** Create an ACF plot for the data in `y`. Be sure to label the x-axis as `"Lag [hours]"` and the y-axis as `"Correlation Coefficient"`. Use the title `"Dar es Salaam PM2.5 Readings, ACF"`.

    Task 3.5.7: Create an ACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient". Use the title "Dar es Salaam PM2.5 Readings, ACF".

    [13]:
    plt.title("Dar es Salaam PM2.6 Readings, ACF")
    [46]:
        wqet_grader.grade("Project 3 Assessment", "Task 3.5.7", file)

    Boom! You got it.

    Score: 1

    xxxxxxxxxx
    **Task 3.5.8:** Create an PACF plot for the data in `y`. Be sure to label the x-axis as `"Lag [hours]"` and the y-axis as `"Correlation Coefficient"`. Use the title `"Dar es Salaam PM2.5 Readings, PACF"`.

    Task 3.5.8: Create an PACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient". Use the title "Dar es Salaam PM2.5 Readings, PACF".

    [14]:
    plt.title("Dar es Salaam PM2.6 Readings, ACF")
    [48]:
        wqet_grader.grade("Project 3 Assessment", "Task 3.5.8", file)

    Boom! You got it.

    Score: 1

    xxxxxxxxxx
    ## Split

    Split¶

    xxxxxxxxxx
    **Task 3.5.9:** Split `y` into training and test sets. The first 90% of the data should be in your training set. The remaining 10% should be in the test set.

    Task 3.5.9: Split y into training and test sets. The first 90% of the data should be in your training set. The remaining 10% should be in the test set.

    [15]:
    print("y_train shape:", y_train.shape)
    y_train shape: (1533,)
    y_test shape: (171,)
    
    [50]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.9a", y_train)

    Yes! Keep on rockin'. 🎸That's right.

    Score: 1

    [52]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.9b", y_test)

    🥳

    Score: 1

    xxxxxxxxxx
    # Build Model

    Build Model¶

    xxxxxxxxxx
    ## Baseline

    Baseline¶

    xxxxxxxxxx
    **Task 3.5.10:** Establish the baseline mean absolute error for your model.

    Task 3.5.10: Establish the baseline mean absolute error for your model.

    [16]:
    mae_baseline = mean_absolute_error(y_train,y_pred_baseline)
    Mean P2 Reading: 8.617582545265428
    Baseline MAE: 4.07658759405218
    
    [54]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.10", [mae_baseline])

    Correct.

    Score: 1

    xxxxxxxxxx
    ## Iterate

    Iterate¶

    xxxxxxxxxx
    **Task 3.5.11:** You're going to use an AR model to predict PM2.5 readings, but which hyperparameter settings will give you the best performance? Use a `for` loop to train your AR model on using settings for `p` from 1 to 30. Each time you train a new model, calculate its mean absolute error and append the result to the list `maes`. Then store your results in the Series `mae_series`. 

    Task 3.5.11: You're going to use an AR model to predict PM2.5 readings, but which hyperparameter settings will give you the best performance? Use a for loop to train your AR model on using settings for p from 1 to 30. Each time you train a new model, calculate its mean absolute error and append the result to the list maes. Then store your results in the Series mae_series.

    [26]:
            mae = mean_absolute_error(y_train.iloc[p:],y_pred)
    [26]:
    1    0.947888
    2    0.933894
    3    0.920850
    4    0.920153
    5    0.919519
    Name: mae, dtype: float64
    [27]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.11", mae_series)

    Awesome work.

    Score: 1

    xxxxxxxxxx
    **Task 3.5.12:** Look through the results in `mae_series` and determine what value for `p` provides the best performance. Then build and train `final_model` using the best hyperparameter value.

    Task 3.5.12: Look through the results in mae_series and determine what value for p provides the best performance. Then build and train final_model using the best hyperparameter value.

    Note: Make sure that you build and train your model in one line of code, and that the data type of best_model is statsmodels.tsa.ar_model.AutoRegResultsWrapper.

    [28]:
    best_model = AutoReg(y_train,lags=best_p).fit()
    [29]:
        "Project 3 Assessment", "Task 3.5.12", [isinstance(best_model.model, AutoReg)]

    Python master 😁

    Score: 1

    xxxxxxxxxx
    **Task 3.5.13:** Calculate the training residuals for `best_model` and assign the result to `y_train_resid`. **Note** that your name of your Series should be `"residuals"`.

    Task 3.5.13: Calculate the training residuals for best_model and assign the result to y_train_resid. Note that your name of your Series should be "residuals".

    [30]:
    y_train_resid.name = "residuals"
    [30]:
    timestamp
    2018-01-02 09:00:00+03:00   -0.560158
    2018-01-02 10:00:00+03:00   -2.193105
    2018-01-02 11:00:00+03:00   -0.026408
    2018-01-02 12:00:00+03:00    0.820217
    2018-01-02 13:00:00+03:00   -0.078009
    Freq: H, Name: residuals, dtype: float64
    [31]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.13", y_train_resid.tail(1500))

    Party time! 🎉🎉🎉

    Score: 1

    xxxxxxxxxx
    **Task 3.5.14:** Create a histogram of `y_train_resid`. Be sure to label the x-axis as `"Residuals"` and the y-axis as `"Frequency"`. Use the title `"Best Model, Training Residuals"`.

    Task 3.5.14: Create a histogram of y_train_resid. Be sure to label the x-axis as "Residuals" and the y-axis as "Frequency". Use the title "Best Model, Training Residuals".

    [33]:
    plt.title("Best Model, Training Residuals")
    [34]:
        wqet_grader.grade("Project 3 Assessment", "Task 3.5.14", file)

    Boom! You got it.

    Score: 1

    xxxxxxxxxx
    **Task 3.5.15:** Create an ACF plot for `y_train_resid`. Be sure to label the x-axis as `"Lag [hours]"` and y-axis as `"Correlation Coefficient"`. Use the title `"Dar es Salaam, Training Residuals ACF"`.

    Task 3.5.15: Create an ACF plot for y_train_resid. Be sure to label the x-axis as "Lag [hours]" and y-axis as "Correlation Coefficient". Use the title "Dar es Salaam, Training Residuals ACF".

    [35]:
    plt.title("Dar es Salaam, Training Residuals ACF")
    [36]:
        wqet_grader.grade("Project 3 Assessment", "Task 3.5.15", file)

    Very impressive.

    Score: 1

    xxxxxxxxxx
    ## Evaluate

    Evaluate¶

    xxxxxxxxxx
    **Task 3.5.16:** Perform walk-forward validation for your model for the entire test set `y_test`. Store your model's predictions in the Series `y_pred_wfv`. Make sure the name of your Series is `"prediction"` and the name of your Series index is `"timestamp"`.

    Task 3.5.16: Perform walk-forward validation for your model for the entire test set y_test. Store your model's predictions in the Series y_pred_wfv. Make sure the name of your Series is "prediction" and the name of your Series index is "timestamp".

    [39]:
        history = history.append(y_test[next_pred.index])
    [39]:
    timestamp
    2018-03-06 00:00:00+03:00    8.049707
    2018-03-06 01:00:00+03:00    8.676565
    2018-03-06 02:00:00+03:00    6.271593
    2018-03-06 03:00:00+03:00    6.319877
    2018-03-06 04:00:00+03:00    7.163557
    Freq: H, Name: prediction, dtype: float64
    [38]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.16", y_pred_wfv)

    That's the right answer. Keep it up!

    Score: 1

    xxxxxxxxxx
    **Task 3.5.17:** Submit your walk-forward validation predictions to the grader to see test mean absolute error for your model.

    Task 3.5.17: Submit your walk-forward validation predictions to the grader to see test mean absolute error for your model.

    [40]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.17", y_pred_wfv)
    ---------------------------------------------------------------------------
    Exception                                 Traceback (most recent call last)
    Cell In [40], line 1
    ----> 1 wqet_grader.grade("Project 3 Assessment", "Task 3.5.17", y_pred_wfv)
    
    File /opt/conda/lib/python3.9/site-packages/wqet_grader/__init__.py:182, in grade(assessment_id, question_id, submission)
        177 def grade(assessment_id, question_id, submission):
        178   submission_object = {
        179     'type': 'simple',
        180     'argument': [submission]
        181   }
    --> 182   return show_score(grade_submission(assessment_id, question_id, submission_object))
    
    File /opt/conda/lib/python3.9/site-packages/wqet_grader/transport.py:146, in grade_submission(assessment_id, question_id, submission_object)
        144     raise Exception('Grader raised error: {}'.format(error['message']))
        145   else:
    --> 146     raise Exception('Could not grade submission: {}'.format(error['message']))
        147 result = envelope['data']['result']
        149 # Used only in testing
    
    Exception: Could not grade submission: Could not verify access to this assessment: Received error from WQET submission API: You have already passed this course!
    xxxxxxxxxx
    # Communicate Results

    Communicate Results¶

    xxxxxxxxxx
    **Task 3.5.18:** Put the values for `y_test` and `y_pred_wfv` into the DataFrame `df_pred_test` (don't forget the index). Then plot `df_pred_test` using plotly express. Be sure to label the x-axis as `"Date"` and the y-axis as `"PM2.5 Level"`. Use the title `"Dar es Salaam, WFV Predictions"`.

    Task 3.5.18: Put the values for y_test and y_pred_wfv into the DataFrame df_pred_test (don't forget the index). Then plot df_pred_test using plotly express. Be sure to label the x-axis as "Date" and the y-axis as "PM2.5 Level". Use the title "Dar es Salaam, WFV Predictions".

    [42]:
    fig.write_image("images/3-5-18.png", scale=1, height=500, width=700)
    [43]:
        wqet_grader.grade("Project 3 Assessment", "Task 3.5.18", file)
    ---------------------------------------------------------------------------
    Exception                                 Traceback (most recent call last)
    Cell In [43], line 2
          1 with open("images/3-5-18.png", "rb") as file:
    ----> 2     wqet_grader.grade("Project 3 Assessment", "Task 3.5.18", file)
    
    File /opt/conda/lib/python3.9/site-packages/wqet_grader/__init__.py:182, in grade(assessment_id, question_id, submission)
        177 def grade(assessment_id, question_id, submission):
        178   submission_object = {
        179     'type': 'simple',
        180     'argument': [submission]
        181   }
    --> 182   return show_score(grade_submission(assessment_id, question_id, submission_object))
    
    File /opt/conda/lib/python3.9/site-packages/wqet_grader/transport.py:146, in grade_submission(assessment_id, question_id, submission_object)
        144     raise Exception('Grader raised error: {}'.format(error['message']))
        145   else:
    --> 146     raise Exception('Could not grade submission: {}'.format(error['message']))
        147 result = envelope['data']['result']
        149 # Used only in testing
    
    Exception: Could not grade submission: Could not verify access to this assessment: Received error from WQET submission API: You have already passed this course!
    xxxxxxxxxx
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    ​x
     

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    <font size="+3"><strong>Time Series: Core Concepts</strong></font>

    Time Series: Core Concepts

    [ ]:
    from IPython.display import YouTubeVideo
    # Model Types

    Model Types¶

    ## Autoregression Models

    Autoregression Models¶

    Autoregression (AR) is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. AR works in a similar way to autocorrelation: in both cases, we're taking data from one part of a set and comparing it to another part. An AR model regresses itself.

    ## ARMA Models

    ARMA Models¶

    ARMA stands for Auto Regressive Moving Average, and it's a special kind of time-series analysis. So far, we've used autoregression (AR) to build our time-series models, and you might recall that AR models rely on values that remain relatively stable over time. That is, they can predict the future very well, as long as the future looks roughly the same as the past. The trouble with predicting the future is that things can suddenly change, and as a result, the future doesn't look much like the past anymore. These sudden changes — economists call them endogenous shocks — can be as big as a hurricane destroying a city or an unexpected increase in the minimum wage, and they can be as small as a new restaurant opening in a neighborhood or a single person losing their job. In our data, the air quality might be changed if there was a nearby forest fire, or if a building collapsed near one of the sensors and raised a giant cloud of dust.

    Below is a video from Ritvik Kharkar that explains MA models in terms of cupcakes and crazy professors — two things we love!

    [ ]:
    YouTubeVideo("voryLhxiPzE")
    And in this video, Ritvik talks about the ARMA model we use in Project 3. 

    And in this video, Ritvik talks about the ARMA model we use in Project 3.

    [ ]:
    YouTubeVideo("HhvTlaN06AM")
    # Plots

    Plots¶

    ## ACF Plot

    ACF Plot¶

    When we've worked with autocorrelations in the past, we've treated them like static relationships, but that's not always how they work. Sometimes, we want to actually see how those autocorrelations change over time, which means we need to think of them as *functions*. When we create a visual representation of an autocorrelation function (ACF), we're making an **ACF plot**.

    When we've worked with autocorrelations in the past, we've treated them like static relationships, but that's not always how they work. Sometimes, we want to actually see how those autocorrelations change over time, which means we need to think of them as functions. When we create a visual representation of an autocorrelation function (ACF), we're making an ACF plot.

    ## PACF Plot

    PACF Plot¶

    Autocorrelations take into account two types of observations. Direct observations are the ones that happen exactly at our chosen time-step interval; we might have readings at one-hour intervals starting at 1:00. Indirect observations are the ones that happen between our chosen time-step intervals, at time-steps like 1:38, 2:10, 3:04, etc. Those indirect observations might be helpful, but we can't be sure about that, so it's a good idea to strip them out and see what our graph looks like when it's only showing us direct observations.

    An autocorrelation that only includes the direct observations is called a partial autocorrelation, and when we view that partial autocorrelation as a function, we call it a PACF.

    PACF plots represent those things visually. We want to compare our ACF and PACF plots to see which model best describes our time series. If the ACF data drops off slowly, then that's going to be a better description; if the PACF falls off slowly, then that's going to be a better description.

    # Statistical Concepts

    Statistical Concepts¶

    ## Walk-Forward Validation

    Walk-Forward Validation¶

    Our predictions lose power over time because the model gets farther and farther away from its beginning. But what if we could move that beginning forward with the model? That's what walk-forward validation is. In a walk-forward validation, we re-train the model at for each new observation in the dataset, dropping the data that's the farthest in the past. Let's say that our prediction for what's going to happen at 12:00 is based on what happened at 11:00, 10:00, and 9:00. When we move forward an hour to predict what's going to happen at 1:00, we only use data from 10:00, 11:00, and 12:00, dropping the data from 9:00 because it's now too far in the past.

    ## Parameters

    Parameters¶

    Parameters are the parts of the model that are learned from the training data.

    ## Hyperparameters

    Hyperparameters¶

    We've already seen that parameters are elements that a machine learning model learns from the training data. Hyperparameters, on the other hand, are elements of the model that come from somewhere else. Data scientists choose hyperparameters either by examining the data themselves, or by creating some kind of automated testing of different options to see how they perform. Hyperparameters are set before the model is trained, which means that they significantly impact how the model is trained, and how it subsequently performs. One way of thinking about the difference between the two is that parameters come from inside the model, and hyperparameters come from outside the model.

    When we think about hyperparameters, we think in terms of p values and q values. p values represent the number of lagged observations included in the model, and the q is the size of the moving average window. These values count as hyperparameters because we get to decide what they are. How many lagged observations do we want to include? How big should our window of moving averages be?

    ## Rolling Averages

    Rolling Averages¶

    A rolling average is the mean value of multiple subsets of numbers in a dataset. For example, I might have data relating to the daily income for a shop I own, and as long as the shop stays open, I can calculate a rolling average. On Friday, I might calculate the average income from Monday-Thursday. The next Monday, I might calculate the average income from Tuesday-Friday, and the next day, I might calculate the average income from Wednesday to Monday, and so on. These averages roll, giving me a sense for how the data is changing in relation to any kind of static construct. In this case, and in many data science applications, that construct is time. Calculating rolling averages is helpful for making accurate forecasts about the ways data will change in the future.

    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited. WQU WorldQuant University Applied Data Science Lab QQQQ

    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    xxxxxxxxxx
    ​

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    xxxxxxxxxx
    <font size="+3"><strong>Databases: SQL</strong></font>

    Databases: SQL

    [ ]:
    from IPython.display import YouTubeVideo
    xxxxxxxxxx
    # Working with SQL Databases

    Working with SQL Databases¶

    xxxxxxxxxx
    A database is a collection of interrelated data. The primary goal of a database is to store and retrieve information in a convenient and efficient way. There are many types of databases. In this section, we will be dealing with a **relational database**. A relational database is a widely used database model that consists of a collection of uniquely named **tables** used to store information. The structure of a database model with its tables, constraints, and relationships is called a **schema**. 

    A database is a collection of interrelated data. The primary goal of a database is to store and retrieve information in a convenient and efficient way. There are many types of databases. In this section, we will be dealing with a relational database. A relational database is a widely used database model that consists of a collection of uniquely named tables used to store information. The structure of a database model with its tables, constraints, and relationships is called a schema.

    A Structured Query Language (SQL), is used to retrieve information from a relational database. SQL is one of the most commonly used database languages. It allows data stored in a relational database to be queried, modified, and manipulated easily with basic commands. SQL powers database engines like MySQL, SQL Server, SQLite, and PostgreSQL. The examples and projects in this course will use SQLite.

    A table refers to a collection of rows and columns in a relational database. When reading data into a pandas DataFrame, an index can be defined, which acts as the label for every row in the DataFrame.

    xxxxxxxxxx
    # Connecting to a Database

    Connecting to a Database¶

    xxxxxxxxxx
    ## ipython-sql 

    ipython-sql¶

    xxxxxxxxxx
    ### Magic Commands

    Magic Commands¶

    xxxxxxxxxx
    Jupyter notebooks can run code that is not valid Python code but still affect the notebook . These special commands are called magic commands. Magic commands can have a range of properties. Some commonly used magic functions are below:

    Jupyter notebooks can run code that is not valid Python code but still affect the notebook . These special commands are called magic commands. Magic commands can have a range of properties. Some commonly used magic functions are below:

    Magic Command Description of Command
    %pwd Print the current working directory
    %cd Change the current working directory
    %ls List the contents of the current directory
    %history Show the history of the In [ ]: commands

    We will be leveraging magic commands to work with a SQLite database.

    xxxxxxxxxx
    ### ipython-sql

    ipython-sql¶

    xxxxxxxxxx
    `ipython-sql` allows you to write SQL code directly in a Jupyter Notebook. The `%sql` (or `%%sql`) magic command is added to the beginning of a code block and then SQL code can be written.

    ipython-sql allows you to write SQL code directly in a Jupyter Notebook. The %sql (or %%sql) magic command is added to the beginning of a code block and then SQL code can be written.

    xxxxxxxxxx
    ### Connecting with ipython-sql

    Connecting with ipython-sql¶

    xxxxxxxxxx
    We can connect to a database using the %sql magic function:

    We can connect to a database using the %sql magic function:

    [ ]:
    %sql sqlite:////home/jovyan/nepal.sqlite
    xxxxxxxxxx
    ## sqlite3

    sqlite3¶

    xxxxxxxxxx
    We can also connect to the same database using the sqlite3 package:

    We can also connect to the same database using the sqlite3 package:

    [ ]:
    conn = sqlite3.connect("/home/jovyan/nepal.sqlite")
    xxxxxxxxxx
    # Querying a Database

    Querying a Database¶

    xxxxxxxxxx
    ## Building Blocks of the Basic Query

    Building Blocks of the Basic Query¶

    xxxxxxxxxx
    There are six common clauses used for querying data:

    There are six common clauses used for querying data:

    Clause Name Definition
    SELECT Determines which columns to include in the query's result
    FROM Identifies the table from which to query the data from
    WHERE filters data
    GROUP BY groups rows by common values in columns
    HAVING filters out unwanted groups from GROUP BY
    ORDER BY Orders the rows using one or more columns
    LIMIT Outputs the specified number of rows

    All clauses may be used together, but SELECT and FROM are the only required clauses. The format of clauses is in the example query below:

    SELECT column1, column2
    FROM table_name
    WHERE "conditions"
    GROUP BY "column-list"
    HAVING "conditions"
    ORDER BY "column-list"
    
    xxxxxxxxxx
    ## SELECT and FROM

    SELECT and FROM¶

    xxxxxxxxxx
    You can use `SELECT *` to select all columns in a table. `FROM` specifies the table in the database to query. `LIMIT 5` will select only the first five rows. 

    You can use SELECT * to select all columns in a table. FROM specifies the table in the database to query. LIMIT 5 will select only the first five rows.

    Example

    [ ]:
    FROM id_map
    xxxxxxxxxx
    You can also use `SELECT` to select certain columns in a table

    You can also use SELECT to select certain columns in a table

    [ ]:
    SELECT household_id,
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Use SELECT to select the district_id column from the id_map table.

    [ ]:
    %%sql
    xxxxxxxxxx
    We can also assign an **alias** or temporary name to a column using the `AS` command. Aliases can also be used on a table. See the example below, which assigns the alias `household_number` to `household_id`

    We can also assign an alias or temporary name to a column using the AS command. Aliases can also be used on a table. See the example below, which assigns the alias household_number to household_id

    [ ]:
    SELECT household_id AS household_number,
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Use SELECT, FROM, AS, and LIMIT to select the first 5 rows from the id_map table. Rename the district_id column to district_number.

    [ ]:
    %%sql
    xxxxxxxxxx
    ## Filtering and Sorting Data

    Filtering and Sorting Data¶

    xxxxxxxxxx
    SQL provides a variety of comparison operators that can be used with the WHERE clause to filter the data. 

    SQL provides a variety of comparison operators that can be used with the WHERE clause to filter the data.

    Comparison Operator Description
    = Equal
    > Greater than
    < Less than
    >= Greater than or equal to
    <= Less than or equal to
    <> or != Not equal to
    LIKE String comparison test
    xxxxxxxxxx
    For example, to select the first 5 homes in Ramechhap (district `2`):

    For example, to select the first 5 homes in Ramechhap (district 2):

    [ ]:
    %%sql
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Use WHERE to select the row with household_id equal to 13735001

    [ ]:
    %%sql
    xxxxxxxxxx
    ## Aggregating Data

    Aggregating Data¶

    xxxxxxxxxx
    Aggregation functions take a collection of values as inputs and return one value as the output. The table below gives the frequently used built-in aggregation functions:

    Aggregation functions take a collection of values as inputs and return one value as the output. The table below gives the frequently used built-in aggregation functions:

    Aggregation Function Definition
    MIN Return the minimum value
    MAX Return the largest value
    SUM Return the sum of values
    AVG Return the average of values
    COUNT Return the number of observations
    xxxxxxxxxx
    Use the `COUNT` function to find the number of observations in the `id_map` table that come from Ramechhap (district `2`):

    Use the COUNT function to find the number of observations in the id_map table that come from Ramechhap (district 2):

    [ ]:
    WHERE district_id = 2
    xxxxxxxxxx
    Aggregation functions are frequently used with a `GROUP BY` clause to perform the aggregation on groups of data. For example, the query below returns the count of observations in each District:

    Aggregation functions are frequently used with a GROUP BY clause to perform the aggregation on groups of data. For example, the query below returns the count of observations in each District:

    [ ]:
    GROUP BY district_id
    xxxxxxxxxx
     `DISTINCT` is a keyword to select unique records in a query result. For example, if we want to know the unique values in the `district_id` column:

    DISTINCT is a keyword to select unique records in a query result. For example, if we want to know the unique values in the district_id column:

    [ ]:
    SELECT distinct(district_id)
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Use DISTINCT to count the number of unique values in the vdcmun_id column.

    [ ]:
    %%sql
    xxxxxxxxxx
    `DISTINCT` and `COUNT` can be used in combination to count the number of distinct records. For example, if we want to know the number of unique values in the `district_id` column:

    DISTINCT and COUNT can be used in combination to count the number of distinct records. For example, if we want to know the number of unique values in the district_id column:

    [ ]:
    SELECT count(distinct(district_id))
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Use DISTINCT and COUNT to count the number of unique values in the vdcmun_id column.

    [ ]:
    %%sql
    xxxxxxxxxx
    # Joining Tables

    Joining Tables¶

    xxxxxxxxxx
    Joins link data from two or more tables together by using a column that is common between the two tables. The basic syntax for a join is below, where `table1` and `table2` refer to the two tables being joined, `column1` and `column2` refer to columns to be returned from both tables, and `ID` refers to the common column in the two tables. 

    Joins link data from two or more tables together by using a column that is common between the two tables. The basic syntax for a join is below, where table1 and table2 refer to the two tables being joined, column1 and column2 refer to columns to be returned from both tables, and ID refers to the common column in the two tables.

    SELECT table1.column1,
           table2.column2
    FROM table_1
    JOIN table2 ON table1.id = table1.id
    
    xxxxxxxxxx
    We'll explore the concept of joins by first identifying a single household that we'd like to pull in building information for. For example, let's say we want to see the corresponding `foundation_type` for the first home in Ramechhap (District 1). We'll start by looking at this single record in the `id_map` table.

    We'll explore the concept of joins by first identifying a single household that we'd like to pull in building information for. For example, let's say we want to see the corresponding foundation_type for the first home in Ramechhap (District 1). We'll start by looking at this single record in the id_map table.

    [ ]:
    WHERE district_id = 2
    xxxxxxxxxx
    This household has `building_id` equal to 23. Let's look at the `foundation_type` for this building, by filtering the `building_structure` table to find this building.

    This household has building_id equal to 23. Let's look at the foundation_type for this building, by filtering the building_structure table to find this building.

    [ ]:
    FROM building_structure
    xxxxxxxxxx
    To join the two tables and limit the results to `building_id = 23`:    

    To join the two tables and limit the results to building_id = 23:

    [ ]:
    JOIN building_structure ON id_map.building_id = building_structure.building_id
    xxxxxxxxxx
    In addition to the basic `JOIN` clause, specific join types can be specified, which specify whether the common column needs to be in one, both, or either of the two tables being joined. The different join types are below. The left table is the table specified first, immediately after the `FROM` clause and the right table is the table specified after the `JOIN` clause. If the generic `JOIN` clause is used, then by default the `INNER JOIN` will be used.

    In addition to the basic JOIN clause, specific join types can be specified, which specify whether the common column needs to be in one, both, or either of the two tables being joined. The different join types are below. The left table is the table specified first, immediately after the FROM clause and the right table is the table specified after the JOIN clause. If the generic JOIN clause is used, then by default the INNER JOIN will be used.

    JOIN Type Definition
    INNER JOIN Returns rows where ID is in both tables
    LEFT JOIN Returns rows where ID is in the left table. Return NA for values in column, if ID is not in right table.
    RIGHT JOIN Returns rows where ID is in the right table. Return NA for values in column, if ID is not in left table.
    FULL JOIN Returns rows where ID is in either table. Return NA for values in column, if ID is not in either table.
    WQU WorldQuant University Applied Data Science Lab QQQQ
    xxxxxxxxxx
    The video below outlines the main types of joins:

    The video below outlines the main types of joins:

    [ ]:
    YouTubeVideo("2HVMiPPuPIM")
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Use the DISTINCT command to create a column with all unique building IDs in the id_map table. LEFT JOIN this column with the roof_type column from the building_structure table, showing only buildings where district_id is 1 and limiting your results to the first five rows of the new table.

    [ ]:
    %%sql
    xxxxxxxxxx
    # Using pandas with SQL Databases

    Using pandas with SQL Databases¶

    xxxxxxxxxx
    To save the output of a query into a pandas DataFrame, we will use connect to the SQLite database using the SQLite3 package:

    To save the output of a query into a pandas DataFrame, we will use connect to the SQLite database using the SQLite3 package:

    [ ]:
    conn = sqlite3.connect("/home/jovyan/nepal.sqlite")
    xxxxxxxxxx
    To run a query using `sqlite3`, we need to store the query as a string. For example, the variable below called `query` is a string containing a query which returns the first 10 rows from the `id_map` table:

    To run a query using sqlite3, we need to store the query as a string. For example, the variable below called query is a string containing a query which returns the first 10 rows from the id_map table:

    [ ]:
        FROM id_map
    xxxxxxxxxx
    To save the results of the query into a pandas DataFrame, use the `pd.read_sql()` function. The optional parameter `index_col` can be used to set the index to a specific column from the query. 

    To save the results of the query into a pandas DataFrame, use the pd.read_sql() function. The optional parameter index_col can be used to set the index to a specific column from the query.

    [ ]:
    df = pd.read_sql(query, conn, index_col="building_id")
    xxxxxxxxxx
    <font size="+1">Practice</font>

    Practice

    Try it yourself! Use the pd.read_sql function to save the results of a query to a DataFrame. The query should select first 20 rows from the id_map table.

    [ ]:
    query = ...
    xxxxxxxxxx
    # References & Further Reading

    References & Further Reading¶

    • Additional Explanation of Magic Commands
    • ipython-SQL User Documentation
    • Data Carpentry Course on SQL in Python
    • SQL Course Material on GitHub (1)
    • SQL Course Material on GitHub (2)
    xxxxxxxxxx
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    xxxxxxxxxx
    ​

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    xxxxxxxxxx
    <font size="+3"><strong>3.3. Autoregressive Models</strong></font>

    3.3. Autoregressive Models

    [1]:
    xxxxxxxxxx
     
    import warnings
    ​
    import matplotlib.pyplot as plt
    import pandas as pd
    import plotly.express as px
    from IPython.display import VimeoVideo
    from pymongo import MongoClient
    from sklearn.metrics import mean_absolute_error
    from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
    from statsmodels.tsa.ar_model import AutoReg
    ​
    warnings.simplefilter(action="ignore", category=FutureWarning)
    /opt/conda/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
      from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
    /opt/conda/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
      from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
    
    [2]:
    xxxxxxxxxx
     
    VimeoVideo("665851858", h="e39fc3d260", width=600)
    [2]:
    xxxxxxxxxx
    # Prepare Data

    Prepare Data¶

    xxxxxxxxxx
    ## Import

    Import¶

    [3]:
    xxxxxxxxxx
     
    VimeoVideo("665851852", h="16aa0a56e6", width=600)
    [3]:
    xxxxxxxxxx
    **Task 3.3.1:** Complete to the create a client to connect to the MongoDB server, assigns the `"air-quality"` database to `db`, and assigned the `"nairobi"` connection to `nairobi`.

    Task 3.3.1: Complete to the create a client to connect to the MongoDB server, assigns the "air-quality" database to db, and assigned the "nairobi" connection to nairobi.

    • Create a client object for a MongoDB instance.
    • Access a database using PyMongo.
    • Access a collection in a database using PyMongo.
    [4]:
    xxxxxxxxxx
     
    client = MongoClient(host="localhost", port=27017)
    db = client["air-quality"]
    ​
    nairobi = db["nairobi"]
    [5]:
    xxxxxxxxxx
     
    VimeoVideo("665851840", h="e048425f49", width=600)
    [5]:
    xxxxxxxxxx
    **Task 3.3.2:** Change the `wrangle` function below so that it returns a Series of the resampled data instead of a DataFrame.

    Task 3.3.2: Change the wrangle function below so that it returns a Series of the resampled data instead of a DataFrame.

    [6]:
    xxxxxxxxxx
     
    def wrangle(collection):
        results = collection.find(
            {"metadata.site": 29, "metadata.measurement": "P2"},
            projection={"P2": 1, "timestamp": 1, "_id": 0},
        )
    ​
        # Read data into DataFrame
        df = pd.DataFrame(list(results)).set_index("timestamp")
    ​
        # Localize timezone
        df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")
    ​
        # Remove outliers
        df = df[df["P2"] < 500]
    ​
        # Resample to 1hr window
        y = df["P2"].resample("1H").mean().fillna(method='ffill')
    ​
        return y
    xxxxxxxxxx
    **Task 3.3.3:** Use your wrangle function to read the data from the `nairobi` collection into the Series `y`.

    Task 3.3.3: Use your wrangle function to read the data from the nairobi collection into the Series y.

    [7]:
    xxxxxxxxxx
     
    y = wrangle(nairobi)
    y.head()
    [7]:
    timestamp
    2018-09-01 03:00:00+03:00    17.541667
    2018-09-01 04:00:00+03:00    15.800000
    2018-09-01 05:00:00+03:00    11.420000
    2018-09-01 06:00:00+03:00    11.614167
    2018-09-01 07:00:00+03:00    17.665000
    Freq: H, Name: P2, dtype: float64
    [8]:
    xxxxxxxxxx
     
    # Check your work
    assert isinstance(y, pd.Series), f"`y` should be a Series, not type {type(y)}"
    assert len(y) == 2928, f"`y` should have 2928 observations, not {len(y)}"
    assert y.isnull().sum() == 0
    xxxxxxxxxx
    ## Explore

    Explore¶

    [9]:
    xxxxxxxxxx
     
    VimeoVideo("665851830", h="85f58bc92b", width=600)
    [9]:
    [10]:
    xxxxxxxxxx
     
    #y.corr(y) #correlation between y it self is 1
    #y.corr(y.shift(1)) #correlation between y its one hour lag value is 0.65
    #y.corr(y.shift(2)) #correlation between y its one hour lag value is 0.4
    #y.corr(y.shift(3)) #correlation between y its one hour lag value is 0.3
    #we can find the same correlations using ACF plot
    xxxxxxxxxx
    **Task 3.3.4:** Create an ACF plot for the data in `y`. Be sure to label the x-axis as `"Lag [hours]"` and the y-axis as `"Correlation Coefficient"`.

    Task 3.3.4: Create an ACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient".

    • What's an ACF plot?
    • Create an ACF plot using statsmodels
    [11]:
    xxxxxxxxxx
     
    fig, ax = plt.subplots(figsize=(15, 6))
    plot_acf(y,ax=ax)
    plt.xlabel("Lag [hours]")
    plt.ylabel("Correlation Coefficient");
    [12]:
    xxxxxxxxxx
     
    VimeoVideo("665851811", h="ee3a2b5c24", width=600)
    [12]:
    xxxxxxxxxx
    **Task 3.3.5:** Create an PACF plot for the data in `y`. Be sure to label the x-axis as `"Lag [hours]"` and the y-axis as `"Correlation Coefficient"`.

    Task 3.3.5: Create an PACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient".

    • What's a PACF plot?
    • Create an PACF plot using statsmodels
    [13]:
    xxxxxxxxxx
     
    fig, ax = plt.subplots(figsize=(15, 6))
    plot_pacf(y,ax=ax)
    plt.xlabel("Lag [hours]")
    plt.ylabel("Correlation Coefficient");
    xxxxxxxxxx
    ## Split

    Split¶

    [14]:
    xxxxxxxxxx
     
    VimeoVideo("665851798", h="6c191cd94c", width=600)
    [14]:
    xxxxxxxxxx
    **Task 3.3.6:** Split `y` into training and test sets. The first 95% of the data should be in your training set. The remaining 5% should be in the test set.

    Task 3.3.6: Split y into training and test sets. The first 95% of the data should be in your training set. The remaining 5% should be in the test set.

    • Divide data into training and test sets in pandas.
    [15]:
    xxxxxxxxxx
     
    cutoff_test = int(len(y)*0.95)
    ​
    y_train = y.iloc[:cutoff_test]
    y_test = y.iloc[cutoff_test:]
    [16]:
    xxxxxxxxxx
     
    len(y_train)+len(y_test)==len(y)
    [16]:
    True
    xxxxxxxxxx
    # Build Model

    Build Model¶

    xxxxxxxxxx
    ## Baseline

    Baseline¶

    xxxxxxxxxx
    **Task 3.3.7:** Calculate the baseline mean absolute error for your model.

    Task 3.3.7: Calculate the baseline mean absolute error for your model.

    • Calculate summary statistics for a DataFrame or Series in pandas.
    [17]:
    xxxxxxxxxx
     
    y_train_mean = y_train.mean()
    y_pred_baseline = [y_train_mean] * len(y_train)
    mae_baseline = mean_absolute_error(y_train, y_pred_baseline)
    ​
    print("Mean P2 Reading:", round(y_train_mean, 2))
    print("Baseline MAE:", round(mae_baseline, 2))
    Mean P2 Reading: 9.22
    Baseline MAE: 3.71
    
    xxxxxxxxxx
    ## Iterate

    Iterate¶

    [18]:
    xxxxxxxxxx
     
    VimeoVideo("665851769", h="94a4296cde", width=600)
    [18]:
    xxxxxxxxxx
    **Task 3.3.8:** Instantiate an [`AutoReg`](https://www.statsmodels.org/stable/generated/statsmodels.tsa.ar_model.AutoReg.html) model and fit it to the training data `y_train`. Be sure to set the `lags` argument to `26`.

    Task 3.3.8: Instantiate an AutoReg model and fit it to the training data y_train. Be sure to set the lags argument to 26.

    • What's an AR model?
    • Instantiate a predictor in statsmodels.
    • Train a model in statsmodels.
    [19]:
    xxxxxxxxxx
     
    model = AutoReg(y_train,lags=26).fit()
    [20]:
    xxxxxxxxxx
     
    VimeoVideo("665851746", h="1a4511e883", width=600)
    [20]:
    xxxxxxxxxx
    **Task 3.3.9:** Generate a list of training predictions for your model and use them to calculate your training mean absolute error. 

    Task 3.3.9: Generate a list of training predictions for your model and use them to calculate your training mean absolute error.

    • Generate in-sample predictions for a model in statsmodels.
    • Calculate the mean absolute error for a list of predictions in scikit-learn.
    [21]:
    xxxxxxxxxx
     
    y_pred = model.predict().dropna()
    training_mae = mean_absolute_error(y_train.iloc[26:],y_pred)
    print("Training MAE:", training_mae)
    Training MAE: 2.2809871656467036
    
    [22]:
    xxxxxxxxxx
     
    VimeoVideo("665851744", h="60d053b455", width=600)
    [22]:
    xxxxxxxxxx
    **Task 3.3.10:** Use `y_train` and `y_pred` to calculate the residuals for your model.

    Task 3.3.10: Use y_train and y_pred to calculate the residuals for your model.

    • What's a residual?
    • Create new columns derived from existing columns in a DataFrame using pandas.
    [23]:
    xxxxxxxxxx
     
    #y_train_resid = y_train-y_pred #which is equal to or same results with below code
    y_train_resid = model.resid
    y_train_resid.tail()
    [23]:
    timestamp
    2018-12-25 19:00:00+03:00   -0.392002
    2018-12-25 20:00:00+03:00   -1.573180
    2018-12-25 21:00:00+03:00   -0.735747
    2018-12-25 22:00:00+03:00   -2.022221
    2018-12-25 23:00:00+03:00   -0.061916
    Freq: H, dtype: float64
    [24]:
    xxxxxxxxxx
     
    VimeoVideo("665851712", h="9ff0cdba9c", width=600)
    [24]:
    xxxxxxxxxx
    **Task 3.3.11:** Create a plot of `y_train_resid`.

    Task 3.3.11: Create a plot of y_train_resid.

    • Create a line plot using pandas.
    [25]:
    xxxxxxxxxx
     
    fig, ax = plt.subplots(figsize=(15, 6))
    y_train_resid.plot(ylabel="Residual Value",ax=ax)
    [25]:
    <AxesSubplot:xlabel='timestamp', ylabel='Residual Value'>
    [26]:
    xxxxxxxxxx
     
    VimeoVideo("665851702", h="b494adc297", width=600)
    [26]:
    xxxxxxxxxx
    **Task 3.3.12:** Create a histogram of `y_train_resid`.

    Task 3.3.12: Create a histogram of y_train_resid.

    • Create a histogram using plotly express.
    [27]:
    xxxxxxxxxx
     
    y_train_resid.hist()
    plt.xlabel("Residual Value")
    plt.ylabel("Frequency")
    plt.title("AR(26),Distribution of Residuals")
    [27]:
    Text(0.5, 1.0, 'AR(26),Distribution of Residuals')
    [28]:
    xxxxxxxxxx
     
    VimeoVideo("665851684", h="d6d782a1f3", width=600)
    [28]:
    xxxxxxxxxx
    **Task 3.3.13:** Create an ACF plot of `y_train_resid`.

    Task 3.3.13: Create an ACF plot of y_train_resid.

    • What's an ACF plot?
    • Create an ACF plot using statsmodels
    [29]:
    xxxxxxxxxx
     
    fig, ax = plt.subplots(figsize=(15, 6))
    plot_acf(y_train_resid,ax=ax)
    [29]:
    xxxxxxxxxx
    ## Evaluate

    Evaluate¶

    [30]:
    xxxxxxxxxx
     
    VimeoVideo("665851662", h="72e767e121", width=600)
    [30]:
    [31]:
    xxxxxxxxxx
     
    # #method showed in video to work on test data and check mean_absolute_error
    # y_test.index.min()#to get the first value in test data using index
    # y_test.index.max()##to get the first value in test data using index
    # y_pred_test = model.predict(y_test.index.min(),y_test.index.max()).dropna()
    # test_mae = mean_absolute_error(y_test,y_pred_test)
    # print("Test MAE:", test_mae)
    [32]:
    xxxxxxxxxx
     
    #method used in training model
    model = AutoReg(y_test,lags=26).fit()
    xxxxxxxxxx
    **Task 3.3.14:** Calculate the test mean absolute error for your model.

    Task 3.3.14: Calculate the test mean absolute error for your model.

    • Generate out-of-sample predictions using model in statsmodels.
    • Calculate the mean absolute error for a list of predictions in scikit-learn.
    [33]:
    xxxxxxxxxx
     
    y_pred_test = model.predict().dropna()
    test_mae = mean_absolute_error(y_test.iloc[26:],y_pred_test)
    print("Test MAE:", test_mae)
    Test MAE: 1.1581601912057071
    
    xxxxxxxxxx
    **Task 3.3.15:** Create a DataFrame `test_predictions` that has two columns: `"y_test"` and `"y_pred"`. The first should contain the true values for your test set, and the second should contain your model's predictions. Be sure the index of `test_predictions` matches the index of `y_test`.

    Task 3.3.15: Create a DataFrame test_predictions that has two columns: "y_test" and "y_pred". The first should contain the true values for your test set, and the second should contain your model's predictions. Be sure the index of test_predictions matches the index of y_test.

    • Create a DataFrame from a dictionary using pandas.WQU WorldQuant University Applied Data Science Lab QQQQ
    [34]:
    xxxxxxxxxx
     
    df_pred_test = pd.DataFrame(
        {"y_test": y_test, "y_pred": y_pred_test}, index=y_test.index
    )
    [35]:
    xxxxxxxxxx
     
    VimeoVideo("665851628", h="29b43e482e", width=600)
    [35]:
    xxxxxxxxxx
    **Task 3.3.16:** Create a time series plot for the values in `test_predictions` using plotly express. Be sure that the y-axis is properly labeled as `"P2"`.

    Task 3.3.16: Create a time series plot for the values in test_predictions using plotly express. Be sure that the y-axis is properly labeled as "P2".

    • Create a line plot in plotly express.
    [36]:
    xxxxxxxxxx
     
    fig = px.line(df_pred_test, labels={"value": "P2"})
    fig.show()
    [37]:
    xxxxxxxxxx
     
    VimeoVideo("665851599", h="bb30d96e43", width=600)
    [37]:
    xxxxxxxxxx
    **Task 3.3.17:** Perform walk-forward validation for your model for the entire test set `y_test`. Store your model's predictions in the Series `y_pred_wfv`. 

    Task 3.3.17: Perform walk-forward validation for your model for the entire test set y_test. Store your model's predictions in the Series y_pred_wfv.

    • What's walk-forward validation?
    • Perform walk-forward validation for time series model.
    [38]:
    xxxxxxxxxx
     
    %%capture
    ​
    y_pred_wfv = pd.Series()
    history = y_train.copy()
    for i in range(len(y_test)):
        model = AutoReg(history,lags=26).fit()
        next_pred = model.forecast()
        y_pred_wfv = y_pred_wfv.append(next_pred)
        history = history.append(y_test[next_pred.index])
        
    ​
    [39]:
    xxxxxxxxxx
     
    len(y_pred_wfv)==len(y_test)
    [39]:
    True
    [40]:
    xxxxxxxxxx
     
    history.tail(1)
    [40]:
    2019-01-01 02:00:00+03:00    18.803333
    Freq: H, Name: P2, dtype: float64
    [ ]:
    xxxxxxxxxx
     
    ​
    [41]:
    xxxxxxxxxx
     
    model = AutoReg(history,lags=26).fit()
    [42]:
    xxxxxxxxxx
     
    model.forecast()
    [42]:
    2019-01-01 03:00:00+03:00    14.110356
    Freq: H, dtype: float64
    [43]:
    xxxxxxxxxx
     
    y_test.head(1)
    [43]:
    timestamp
    2018-12-26 00:00:00+03:00    5.679091
    Freq: H, Name: P2, dtype: float64
    [ ]:
    xxxxxxxxxx
     
    ​
    [44]:
    xxxxxxxxxx
     
    VimeoVideo("665851568", h="a764ab5416", width=600)
    [44]:
    xxxxxxxxxx
    **Task 3.3.18:** Calculate the test mean absolute error for your model.

    Task 3.3.18: Calculate the test mean absolute error for your model.

    • Calculate the mean absolute error for a list of predictions in scikit-learn.
    [45]:
    xxxxxxxxxx
     
    test_mae = mean_absolute_error(y_test,y_pred_wfv)
    print("Test MAE (walk forward validation):", round(test_mae, 2))
    Test MAE (walk forward validation): 1.4
    
    xxxxxxxxxx
    # Communicate Results

    Communicate Results¶

    [46]:
    xxxxxxxxxx
     
    VimeoVideo("665851553", h="46338036cc", width=600)
    [46]:
    xxxxxxxxxx
    **Task 3.3.19:** Print out the parameters for your trained model.

    Task 3.3.19: Print out the parameters for your trained model.

    • Access model parameters in statsmodels
    [47]:
    xxxxxxxxxx
     
    #unlike in scikit learn finding intercept and coefficients in statsmodels is very easy. below code will give those intercept and coefficient values
    print(model.params)
    intercept    2.021763
    P2.L1        0.587991
    P2.L2        0.019749
    P2.L3        0.023426
    P2.L4        0.026807
    P2.L5        0.044019
    P2.L6       -0.101889
    P2.L7        0.029814
    P2.L8        0.049894
    P2.L9       -0.017513
    P2.L10       0.032585
    P2.L11       0.064034
    P2.L12       0.005978
    P2.L13       0.018366
    P2.L14      -0.007634
    P2.L15      -0.016571
    P2.L16      -0.015587
    P2.L17      -0.035295
    P2.L18       0.000929
    P2.L19      -0.003956
    P2.L20      -0.020668
    P2.L21      -0.012730
    P2.L22       0.052431
    P2.L23       0.074405
    P2.L24      -0.024263
    P2.L25       0.090515
    P2.L26      -0.088646
    dtype: float64
    
    [48]:
    xxxxxxxxxx
     
    VimeoVideo("665851529", h="39284d37ac", width=600)
    [48]:
    xxxxxxxxxx
    **Task 3.3.20:** Put the values for `y_test` and `y_pred_wfv` into the DataFrame `df_pred_test` (don't forget the index). Then plot `df_pred_test` using plotly express.

    Task 3.3.20: Put the values for y_test and y_pred_wfv into the DataFrame df_pred_test (don't forget the index). Then plot df_pred_test using plotly express.

    • Create a line plot in plotly express.
    [49]:
    xxxxxxxxxx
     
    df_pred_test = pd.DataFrame(
        {"y_test":y_test,"y_pred_wfv":y_pred_wfv}
    ​
    )
    fig = px.line(df_pred_test,labels={"Value":"PM2.5"})
    fig.show()
    xxxxxxxxxx
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    xxxxxxxxxx
    ​

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    xxxxxxxxxx
    <font size="+3"><strong>3.2. Linear Regression with Time Series Data</strong></font>

    3.2. Linear Regression with Time Series Data

    [1]:
    from sklearn.linear_model import LinearRegression
    [2]:
    VimeoVideo("665412117", h="c39a50bd58", width=600)
    [2]:
    xxxxxxxxxx
    # Prepare Data

    Prepare Data¶

    xxxxxxxxxx
    ## Import

    Import¶

    [3]:
    VimeoVideo("665412469", h="135f32c7da", width=600)
    [3]:
    xxxxxxxxxx
    **Task 3.2.1:** Complete to the create a client to connect to the MongoDB server, assign the `"air-quality"` database to `db`, and assign the `"nairobi"` connection to `nairobi`.

    Task 3.2.1: Complete to the create a client to connect to the MongoDB server, assign the "air-quality" database to db, and assign the "nairobi" connection to nairobi.

    • Create a client object for a MongoDB instance.
    • Access a database using PyMongo.
    • Access a collection in a database using PyMongo.
    [4]:
    client = MongoClient(host="localhost",port=27017)
    [5]:
    VimeoVideo("665412480", h="c20ed3e570", width=600)
    [5]:
    xxxxxxxxxx
    **Task 3.2.2:** Complete the `wrangle` function below so that the `results` from the database query are read into the DataFrame `df`. Be sure that the index of `df` is the `"timestamp"` from the results. 

    Task 3.2.2: Complete the wrangle function below so that the results from the database query are read into the DataFrame df. Be sure that the index of df is the "timestamp" from the results.

    • Create a DataFrame from a dictionary using pandas.
    [6]:
        df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")
    [7]:
    VimeoVideo("665412496", h="d757475f7c", width=600)
    [7]:
    xxxxxxxxxx
    **Task 3.2.3:** Use your wrangle function to read the data from the `nairobi` collection into the DataFrame `df`.

    Task 3.2.3: Use your wrangle function to read the data from the nairobi collection into the DataFrame df.

    [8]:
    df = wrangle(nairobi)
    [8]:
    P2 P2.L1
    timestamp
    2018-09-01 04:00:00+03:00 15.800000 17.541667
    2018-09-01 05:00:00+03:00 11.420000 15.800000
    2018-09-01 06:00:00+03:00 11.614167 11.420000
    2018-09-01 07:00:00+03:00 17.665000 11.614167
    2018-09-01 08:00:00+03:00 21.016667 17.665000
    2018-09-01 09:00:00+03:00 22.589167 21.016667
    2018-09-01 10:00:00+03:00 18.605833 22.589167
    2018-09-01 11:00:00+03:00 14.022500 18.605833
    2018-09-01 12:00:00+03:00 13.150000 14.022500
    2018-09-01 13:00:00+03:00 12.806667 13.150000
    [9]:
    assert any([isinstance(df, pd.DataFrame), isinstance(df, pd.Series)])
    [10]:
    VimeoVideo("665412520", h="e03eefff07", width=600)
    [10]:
    xxxxxxxxxx
    **Task 3.2.4:** Add to your `wrangle` function so that the `DatetimeIndex` for `df` is localized to the correct timezone, `"Africa/Nairobi"`. Don't forget to re-run all the cells above after you change the function. 

    Task 3.2.4: Add to your wrangle function so that the DatetimeIndex for df is localized to the correct timezone, "Africa/Nairobi". Don't forget to re-run all the cells above after you change the function.

    • Localize a timestamp to another timezone using pandas.
    [11]:
    # df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")
    [12]:
    assert df.index.tzinfo == pytz.timezone("Africa/Nairobi")
    xxxxxxxxxx
    ## Explore

    Explore¶

    [13]:
    VimeoVideo("665412546", h="97792cb982", width=600)
    [13]:
    xxxxxxxxxx
    **Task 3.2.5:** Create a boxplot of the `"P2"` readings in `df`. 

    Task 3.2.5: Create a boxplot of the "P2" readings in df.

    • Create a boxplot using pandas.
    [14]:
    df["P2"].plot(kind="box",vert=False,ax=ax)
    [14]:
    <AxesSubplot:>
    [15]:
    VimeoVideo("665412573", h="b46049021b", width=600)
    [15]:
    xxxxxxxxxx
    **Task 3.2.6:** Add to your `wrangle` function so that all `"P2"` readings above 500 are dropped from the dataset. Don't forget to re-run all the cells above after you change the function. 

    Task 3.2.6: Add to your wrangle function so that all "P2" readings above 500 are dropped from the dataset. Don't forget to re-run all the cells above after you change the function.

    • Subset a DataFrame with a mask using pandas.
    [16]:
    assert len(df) <= 32906
    [17]:
    VimeoVideo("665412594", h="e56c2f6839", width=600)
    [17]:
    xxxxxxxxxx
    **Task 3.2.7:** Create a time series plot of the `"P2"` readings in `df`.

    Task 3.2.7: Create a time series plot of the "P2" readings in df.

    • Create a line plot using pandas.
    [18]:
    fig,ax = plt.subplots(figsize=(15, 6))
    [19]:
    VimeoVideo("665412601", h="a16c5a73fc", width=600)
    [19]:
    xxxxxxxxxx
    **Task 3.2.8:** Add to your `wrangle` function to resample `df` to provide the mean `"P2"` reading for each hour. Use a forward fill to impute any missing values. Don't forget to re-run all the cells above after you change the function. 

    Task 3.2.8: Add to your wrangle function to resample df to provide the mean "P2" reading for each hour. Use a forward fill to impute any missing values. Don't forget to re-run all the cells above after you change the function.

    • Resample time series data in pandas.
    • Impute missing time series values using pandas.
    [20]:
    # df["P2"].resample("1H").mean().fillna(method="ffill").to_frame() #.isnull().sum()
    [21]:
    assert len(df) <= 2928
    [22]:
    VimeoVideo("665412649", h="d2e99d2e75", width=600)
    [22]:
    xxxxxxxxxx
    **Task 3.2.9:** Plot the rolling average of the `"P2"` readings in `df`. Use a window size of `168` (the number of hours in a week).

    Task 3.2.9: Plot the rolling average of the "P2" readings in df. Use a window size of 168 (the number of hours in a week).

    • What's a rolling window?
    • Do a rolling window calculation in pandas.
    • Make a line plot with time series data in pandas.
    [23]:
    df["P2"].rolling(168).mean().plot(ax=ax);
    [24]:
    VimeoVideo("665412693", h="c3bca16aff", width=600)
    [24]:
    xxxxxxxxxx
    **Task 3.2.10:** Add to your `wrangle` function to create a column called `"P2.L1"` that contains the mean`"P2"` reading from the previous hour. Since this new feature will create `NaN` values in your DataFrame, be sure to also drop null rows from `df`.

    Task 3.2.10: Add to your wrangle function to create a column called "P2.L1" that contains the mean"P2" reading from the previous hour. Since this new feature will create NaN values in your DataFrame, be sure to also drop null rows from df.

    • Shift the index of a Series in pandas.
    • Drop rows with missing values from a DataFrame using pandas.
    [25]:
    assert len(df) <= 11686
    [26]:
    VimeoVideo("665412732", h="059e4088c5", width=600)
    [26]:
    xxxxxxxxxx
    **Task 3.2.11:** Create a correlation matrix for `df`.

    Task 3.2.11: Create a correlation matrix for df.

    • Create a correlation matrix in pandas.
    [27]:
    df.corr()
    [27]:
    P2 P2.L1
    P2 1.000000 0.650679
    P2.L1 0.650679 1.000000
    [28]:
    VimeoVideo("665412741", h="7439cb107c", width=600)
    [28]:
    xxxxxxxxxx
    **Task 3.2.12:** Create a scatter plot that shows PM 2.5 mean reading for each our as a function of the mean reading from the previous hour. In other words, `"P2.L1"` should be on the x-axis, and `"P2"` should be on the y-axis. Don't forget to label your axes!

    Task 3.2.12: Create a scatter plot that shows PM 2.5 mean reading for each our as a function of the mean reading from the previous hour. In other words, "P2.L1" should be on the x-axis, and "P2" should be on the y-axis. Don't forget to label your axes!

    • Create a scatter plot using Matplotlib.
    [29]:
    ax.plot([0,120],[0,120],linestyle="--",color="orange")
    [29]:
    [<matplotlib.lines.Line2D at 0x7f7eae8ce940>]
    xxxxxxxxxx
    ## Split

    Split¶

    [30]:
    VimeoVideo("665412762", h="a5eba496f7", width=600)
    [30]:
    xxxxxxxxxx
    **Task 3.2.13:** Split the DataFrame `df` into the feature matrix `X` and the target vector `y`. Your target is `"P2"`.

    Task 3.2.13: Split the DataFrame df into the feature matrix X and the target vector y. Your target is "P2".

    • Subset a DataFrame by selecting one or more columns in pandas.
    • Select a Series from a DataFrame in pandas.
    [35]:
    feature=["P2.L1"]
    [35]:
    P2.L1
    timestamp
    2018-09-01 04:00:00+03:00 17.541667
    2018-09-01 05:00:00+03:00 15.800000
    2018-09-01 06:00:00+03:00 11.420000
    2018-09-01 07:00:00+03:00 11.614167
    2018-09-01 08:00:00+03:00 17.665000
    [36]:
    VimeoVideo("665412785", h="03118eda71", width=600)
    [36]:
    xxxxxxxxxx
    **Task 3.2.14:** Split `X` and `y` into training and test sets. The first 80% of the data should be in your training set. The remaining 20% should be in the test set.

    Task 3.2.14: Split X and y into training and test sets. The first 80% of the data should be in your training set. The remaining 20% should be in the test set.

    • Divide data into training and test sets in pandas.
    [37]:
    X_train, y_train = X.iloc[:cutoff],y.iloc[:cutoff]
    [38]:
    len(X_train)+len(X_test)==len(X)
    [38]:
    True
    xxxxxxxxxx
    # Build Model

    Build Model¶

    xxxxxxxxxx
    ## Baseline

    Baseline¶

    xxxxxxxxxx
    **Task 3.2.15:** Calculate the baseline mean absolute error for your model.

    Task 3.2.15: Calculate the baseline mean absolute error for your model.

    • Calculate summary statistics for a DataFrame or Series in pandas.
    [45]:
    mae_baseline = mean_absolute_error(y_train,y_pred_baseline)
    Mean P2 Reading: 9.27
    Baseline MAE: 3.89
    
    xxxxxxxxxx
    ## Iterate

    Iterate¶

    xxxxxxxxxx
    **Task 3.2.16:** Instantiate a [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model named `model`, and fit it to your training data.

    Task 3.2.16: Instantiate a LinearRegression model named model, and fit it to your training data.

    • Instantiate a predictor in scikit-learn.
    • Fit a model to training data in scikit-learn.
    [47]:
    model = LinearRegression()
    [47]:
    LinearRegression()
    In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
    On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
    LinearRegression()
    xxxxxxxxxx
    ## Evaluate

    Evaluate¶

    [48]:
    VimeoVideo("665412844", h="129865775d", width=600)
    [48]:
    xxxxxxxxxx
    **Task 3.2.17:** Calculate the training and test mean absolute error for your model.

    Task 3.2.17: Calculate the training and test mean absolute error for your model.

    • Generate predictions using a trained model in scikit-learn.
    • Calculate the mean absolute error for a list of predictions in scikit-learn.
    [54]:
    training_mae = mean_absolute_error(y_train,model.predict(X_train))
    Training MAE: 2.46
    Test MAE: 1.8
    
    xxxxxxxxxx
    # Communicate Results

    Communicate Results¶

    xxxxxxxxxx
    **Task 3.2.18:** Extract the intercept and coefficient from your `model`. 

    Task 3.2.18: Extract the intercept and coefficient from your model.

    • Access an object in a pipeline in scikit-learnWQU WorldQuant University Applied Data Science Lab QQQQ
    [61]:
    print(f"P2 = {intercept} + ({coefficient} * P2.L1)")
    P2 = 3.36 + (0.64 * P2.L1)
    
    [56]:
    VimeoVideo("665412870", h="318d69683e", width=600)
    [56]:
    xxxxxxxxxx
    **Task 3.2.19:** Create a DataFrame `df_pred_test` that has two columns: `"y_test"` and `"y_pred"`. The first should contain the true values for your test set, and the second should contain your model's predictions. Be sure the index of `df_pred_test` matches the index of `y_test`.

    Task 3.2.19: Create a DataFrame df_pred_test that has two columns: "y_test" and "y_pred". The first should contain the true values for your test set, and the second should contain your model's predictions. Be sure the index of df_pred_test matches the index of y_test.

    • Create a DataFrame from a dictionary using pandas.
    [62]:
    df_pred_test = pd.DataFrame({"y_test":y_test,"y_pred":model.predict(X_test)})
    [62]:
    y_test y_pred
    timestamp
    2018-12-07 17:00:00+03:00 7.070000 8.478927
    2018-12-07 18:00:00+03:00 8.968333 7.865485
    2018-12-07 19:00:00+03:00 11.630833 9.076421
    2018-12-07 20:00:00+03:00 11.525833 10.774814
    2018-12-07 21:00:00+03:00 9.533333 10.707836
    [63]:
    VimeoVideo("665412891", h="39d7356a26", width=600)
    [63]:
    xxxxxxxxxx
    **Task 3.2.20:** Create a time series line plot for the values in `test_predictions` using plotly express. Be sure that the y-axis is properly labeled as `"P2"`.

    Task 3.2.20: Create a time series line plot for the values in test_predictions using plotly express. Be sure that the y-axis is properly labeled as "P2".

    • Create a line plot using plotly express.
    [65]:
    fig = px.line(df_pred_test)
    xxxxxxxxxx
    ---

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    • 031-data-wrangling-with-mongodb.ipynb
    • 032-linear-regression-with-time-series-data.ipynb
    • 033-autoregressive-models.ipynb
    • 034-arma-models-and-hyperparameter-tuning.ipynb
    • 17-ts-core.ipynb
    • 035-assignment.ipynb
    • 04-pandas-advanced.ipynb
    • 18-ts-models.ipynb
    • 11-databases-mongodb.ipynb
    • 10-databases-sql.ipynb
    xxxxxxxxxx
    <font size="+3"><strong>3.4. ARMA Models</strong></font>
    Advanced Tools
    xxxxxxxxxx
    xxxxxxxxxx

    -

    Variables

    Callstack

      Breakpoints

      Source

      xxxxxxxxxx
      1
      034-arma-models-and-hyperparameter-tuning.ipynb
      • Prepare Data
      • Import
      • Explore
      • Split
      • Build Model
      • Baseline
      • Iterate
      • Evaluate
      • Communicate Results
        0
        10
        Python 3 (ipykernel) | Idle
        Saving completed
        Uploading…
        034-arma-models-and-hyperparameter-tuning.ipynb
        English (United States)
        Spaces: 4
        Ln 1, Col 1
        Mode: Command
        • Console
        • Change Kernel…
        • Clear Console Cells
        • Close and Shut Down…
        • Insert Line Break
        • Interrupt Kernel
        • New Console
        • Restart Kernel…
        • Run Cell (forced)
        • Run Cell (unforced)
        • Show All Kernel Activity
        • Debugger
        • Continue
          Continue
          F9
        • Evaluate Code
          Evaluate Code
        • Next
          Next
          F10
        • Step In
          Step In
          F11
        • Step Out
          Step Out
          Shift+F11
        • Terminate
          Terminate
          Shift+F9
        • Extension Manager
        • Enable Extension Manager
        • File Operations
        • Autosave Documents
        • Open from Path…
          Open from path
        • Reload Notebook from Disk
          Reload contents from disk
        • Revert Notebook to Checkpoint
          Revert contents to previous checkpoint
        • Save Notebook
          Save and create checkpoint
          Ctrl+S
        • Save Notebook As…
          Save with new path
          Ctrl+Shift+S
        • Show Active File in File Browser
        • Trust HTML File
        • Help
        • About JupyterLab
        • Jupyter Forum
        • Jupyter Reference
        • JupyterLab FAQ
        • JupyterLab Reference
        • Launch Classic Notebook
        • Licenses
        • Markdown Reference
        • Reset Application State
        • Image Viewer
        • Flip image horizontally
          H
        • Flip image vertically
          V
        • Invert Colors
          I
        • Reset Image
          0
        • Rotate Clockwise
          ]
        • Rotate Counterclockwise
          [
        • Zoom In
          =
        • Zoom Out
          -
        • Kernel Operations
        • Shut Down All Kernels…
        • Launcher
        • New Launcher
        • Main Area
        • Activate Next Tab
          Ctrl+Shift+]
        • Activate Next Tab Bar
          Ctrl+Shift+.
        • Activate Previous Tab
          Ctrl+Shift+[
        • Activate Previous Tab Bar
          Ctrl+Shift+,
        • Activate Previously Used Tab
          Ctrl+Shift+'
        • Close All Other Tabs
        • Close All Tabs
        • Close Tab
          Alt+W
        • Close Tabs to Right
        • Find Next
          Ctrl+G
        • Find Previous
          Ctrl+Shift+G
        • Find…
          Ctrl+F
        • Log Out
          Log out of JupyterLab
        • Presentation Mode
        • Show Header Above Content
        • Show Left Sidebar
          Ctrl+B
        • Show Log Console
        • Show Right Sidebar
        • Show Status Bar
        • Shut Down
          Shut down JupyterLab
        • Simple Interface
          Ctrl+Shift+D
        • Notebook Cell Operations
        • Change to Code Cell Type
          Y
        • Change to Heading 1
          1
        • Change to Heading 2
          2
        • Change to Heading 3
          3
        • Change to Heading 4
          4
        • Change to Heading 5
          5
        • Change to Heading 6
          6
        • Change to Markdown Cell Type
          M
        • Change to Raw Cell Type
          R
        • Clear Outputs
        • Collapse All Code
        • Collapse All Outputs
        • Collapse Selected Code
        • Collapse Selected Outputs
        • Copy Cells
          C
        • Cut Cells
          X
        • Delete Cells
          D, D
        • Disable Scrolling for Outputs
        • Enable Scrolling for Outputs
        • Expand All Code
        • Expand All Outputs
        • Expand Selected Code
        • Expand Selected Outputs
        • Extend Selection Above
          Shift+K
        • Extend Selection Below
          Shift+J
        • Extend Selection to Bottom
          Shift+End
        • Extend Selection to Top
          Shift+Home
        • Insert Cell Above
          A
        • Insert Cell Below
          B
        • Merge Cell Above
          Ctrl+Backspace
        • Merge Cell Below
          Ctrl+Shift+M
        • Merge Selected Cells
          Shift+M
        • Move Cells Down
        • Move Cells Up
        • Paste Cells Above
        • Paste Cells and Replace
        • Paste Cells Below
          V
        • Redo Cell Operation
          Shift+Z
        • Run Selected Cells
          Shift+Enter
        • Run Selected Cells and Don't Advance
          Ctrl+Enter
        • Run Selected Cells and Insert Below
          Alt+Enter
        • Run Selected Text or Current Line in Console
        • Select Cell Above
          K
        • Select Cell Below
          J
        • Split Cell
          Ctrl+Shift+-
        • Undo Cell Operation
          Z
        • Notebook Operations
        • Change Kernel…
        • Clear All Outputs
        • Close and Shut Down
        • Collapse All Cells
        • Deselect All Cells
        • Enter Command Mode
          Ctrl+M
        • Enter Edit Mode
          Enter
        • Expand All Headings
        • Interrupt Kernel
        • New Console for Notebook
        • New Notebook
          Create a new notebook
        • Reconnect To Kernel
        • Render All Markdown Cells
        • Restart Kernel and Clear All Outputs…
        • Restart Kernel and Run All Cells…
        • Restart Kernel and Run up to Selected Cell…
        • Restart Kernel…
        • Run All Above Selected Cell
        • Run All Cells
        • Run Selected Cell and All Below
        • Select All Cells
          Ctrl+A
        • Toggle All Line Numbers
          Shift+L
        • Toggle Collapse Notebook Heading
          T
        • Trust Notebook
        • Settings
        • Advanced Settings Editor
          Ctrl+,
        • Show Contextual Help
        • Show Contextual Help
          Live updating code documentation from the active kernel
          Ctrl+I
        • Spell Checker
        • Choose spellchecker language
        • Toggle spellchecker
        • Terminal
        • Decrease Terminal Font Size
        • Increase Terminal Font Size
        • New Terminal
          Start a new terminal session
        • Refresh Terminal
          Refresh the current terminal session
        • Use Terminal Theme: Dark
          Set the terminal theme
        • Use Terminal Theme: Inherit
          Set the terminal theme
        • Use Terminal Theme: Light
          Set the terminal theme
        • Text Editor
        • Decrease Font Size
        • Increase Font Size
        • Indent with Tab
        • New Markdown File
          Create a new markdown file
        • New Python File
          Create a new Python file
        • New Text File
          Create a new text file
        • Spaces: 1
        • Spaces: 2
        • Spaces: 4
        • Spaces: 8
        • Theme
        • Decrease Code Font Size
        • Decrease Content Font Size
        • Decrease UI Font Size
        • Increase Code Font Size
        • Increase Content Font Size
        • Increase UI Font Size
        • Theme Scrollbars
        • Use Theme: JupyterLab Dark
        • Use Theme: JupyterLab Light